Optimizing Apache Pinot's Query Performance

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone uh thank you all for joining for another awesome apache pinot meet up um so today our topic is optimizing apache pino's query performance um we have two presenters today we have subu sorry i butchered that subramaniam and jackie zhang um so soup is going to go first we're going to have him come out his topic is optimizing pinot's real time performance following his presentation we'll do a q a and then we'll send him backstage for a little bit we'll have jackie come out jackie is going to do a couple talks optimizing query performance of which index to use and also filter optimizing and segment permanent pruning um so those are the two topics are those are the two speakers for today um so if you have any questions um please ask in the chat right here in the youtube chat if you are watching this after the live stream and you do have questions for the speakers feel free to join the apache pinot slack and you can ask them there subu and jackie are both very active in apache pinot slack so i'm just going to give a quick uh summary on our presenter today um so subu has been working with on pino for the past five years and he has designed and implemented much of the real time infrastructure in pinot right crazy prior to joining the pinot team subo worked on a variety of technologies using key value stores espresso at linkedin p nuts at yahoo and then change capture technology um data bus at linkedin um subu has experience working on embedded systems as well building redundance and fault tolerance in network switches um pretty awesome background so i'm gonna welcome subu hi subu hey kareen thanks for uh nice introduction there yeah it's awesome i'm really excited about this i think i think a lot of people are gonna be really excited about this topic so um so we'll let you i'll hand it off to you afterwards we'll come back we'll do a q a just for those watching sometimes there is a slight delay so if you do have questions for subo ahead of time feel free to ask them in advance and then we'll ask them during the q a so it's okay to fill up the chat and we'll just ask them once he's done his presentation all right thank you i'll let you uh take it away cool um thanks much kareem and uh welcome everybody good morning um just so you know i'm missing a talk by uh leslie lamport at linkedin in order for me to give this presentation so you guys are obviously more valuable than leslie langford um with all jokes aside um so i hope you have a good time here listening to um you know real-time performance of apache pino and how we can make it better and you know what the futures are and so on but before i start off uh an acknowledgement to all my colleagues who've contributed towards building this knowledge both past and present okay so we'll go through an overview of pino and then some factors that influence performance of of real time pinot and then how we can tune pino to get these factors under control and then some futures future ideas that i have on performance improvement so starting with the overview many of you may already know this but nevertheless it's good to start off something like this so pinot has multiple components that we work with controllers brokers offline and real-time servers and and helix is our cluster manager framework um which uses zookeeper as the uh metadata for for our data for helix and then metadata for pino um and there are uh and we have a segment store in which all the pinot data is stored as permanent store the segment store can be populated via back patch data bush which goes to the controllers and the controller stores it in the segment store and an alternative way is to just push directly uh data to the segment store and then let the controller know uh the location of the data and then the controller takes care of that from there and one of the first things that happens is when an offline uh blue segment is pushed uh basically the offline servers get notified that they need to host the data and they start uh you know pulling down the data from segment store and they host that data there similarly on the real-time stream uh there is a real-time input inbound stream that comes in kafka and now we have other streams as well um and the servers start consuming them once the table is created every once in a while the server does a periodic commit of the real-time segments and and they are also stored in the real-time servers for uh for serving purpose um and once that is there the segments are in the in the deep store real-time as well as offline segments the queries come straight to the broker the broker federates them across the real time and offline servers gets their response and passes the response back to the user so on the real-time front zooming in a little bit the architecture is as follows so you have partitioned stream like kafka or other streams so imagine you have p1 through pn you know n partitions of stream with each of those boxes representing messages and you have multiple servers consuming those those partitions in the consumers you have a pinot indexer and a query server the indexer and the query server are common but there is a partition consumer that is instantiated one per partition consumed on the server so if a single server consumes eight partitions and you have eight instances of the partition consumer um in the servers and of course each partition can be consumed multiple times for redundancy and we don't place redundant partitions consuming on the same server so we have different servers so that in case one goes down you have somebody else consuming and indexing that partition for queries so this is the layout of a real-time architecture of the service and in this case the queries are routed to cover all partitions um and all segments of all partitions in the past and you know present okay so we have what's called a lambda architecture which is um basically imagine this is your time progression and you have real-time data that has been consumed for august the second third fourth fifth and sixth is being consumed right now and imagine you have offline data that has been pushed from hadoop or some offline system for you know two years until ending august third let us say let's say you have two year retention on the table so all the old data is available on the offline part of your table and all the new data is available on the real time part of your table and you as a user see only one table so how do we do this so we have let's say a query comes in select sum of metric from the table and if you send it to both offline and real time then you're going to get duplicate data for the second and third of august right so we avoid this by keeping what we call what we call a time boundary on the broker and the broker maintains that time boundary as new data gets pushed on offline the time boundary keeps shifting so for offline so now the time boundary is between second and third and so then the broker when it gets a query like that it basically sends an offline query to the offline servers something like select some metric from table where date is less than august the second so it adds the where clause and it adds another where clause to the real time table sending a different kind of query to the to the real-time servers and then it gets back the two and can consolidate and return the results to the user zooming one level further on real-time segments let's say you have partition four being consumed on a real-time server and the last one here is the currently being consumed the one being consumed so that's the current time and these are the completed segments like i said there is a periodic segment completion and somewhere here is our time boundary or other query boundary i mean you might have real time only uh tables in which you don't have time boundaries but you probably have a boundary of queries saying most of my queries are going to be between this and time and current maybe it's the last one day 90 of the queries are going to be in the last one day or something like that so that would be your segments and then there will be some segments beyond that subject to the retention of your table you might choose to retain your table for 30 days most of your queries would be for the last two days maybe and some of your queries may span 30 days so you want a retention of 30 days right that's very possible so we have three types of segments there you have the completed segments which are dark green and probably cold rare reads and what i call the completed segments warm which is read-only it's not really the right material and then the consuming segments which is being written to so the consuming segment is continuously getting rows from underneath from the streaming layer and it's indexing them as as they are ingested and it makes them available for queries the there may be another partition being consumed in fact several partitions being consumed in this case partition seven is being consumed and you know there are the segments there and if you notice on purpose i have drawn the segment sizes to be slightly different from each other they're not guaranteed to be same size they try to adjust to the size that you specify so they may go over a little bit they may go under depending on the ingestion rate that's coming in and so they might they learn as they go and so it is is it could be slightly different and the time boundary could be well you know in the middle of a segment um or the query boundary for that matter okay so the learning from here is that uh all the segments that are the important point to note here is that all the segments that are loaded uh as as memory mapped file uh there are some exceptions here and we'll walk through those exceptions a little bit later but this is an important point to note that there are file on disk the segments especially the the the last one the consuming the the non-consuming ones the completed ones files on disk and they are just mapped in memory okay um factors influencing performance what are the questions what what is what do we mean by by performance of real time right i think there are two questions that are important here the first one is our rows showing up late in pino if so then there's something needs to be done the second one is our queries not meeting the the sla i wouldn't say queries are slow because slow or fast depends on what your sla is if you have a 10 millisecond sla then 12 milliseconds is slow if you have a 500 millisecond sla 12 millisecond is awesome fast so these are the two main questions that basically the real time system has to deal with and if these two questions are basically coming back with oh everything is good then you can probably go get a bagel and coffee and probably skip the rest of the talk or just use it for entertainment a related question here is you know can we keep it going with less hardware so you might want to think about that right um so these with these three questions let us see what factors uh influence the performance of real-time systems in particular and i'll suppose i'll specifically talk about real-time i know there are other things as well uh with pinot performance but this one's stock is concentrated on real time okay so what are the factors involved there is ingestion rate um the higher messages ingestion rate the more work the real-time server has to do in indexing and messaging indexing the the messages so uh so that's going to take up some cpu and that can cause either slowness or ingestion or you know slowness in queries depending on how things are how much how much is left over juice is left over on the machine there's a schema width you can ingest at 50 000 messages a second per partition if you have i don't know three columns or five columns uh if you have five thousand columns you probably cannot do at that rate uh it will take time to ingest the the row and only after the entire row is ingested is it made available for queries and the schema which is both number of bytes per second as well as a number of fields you might have small number of fields but very large width fields or you might have a large number of fields but you know multiple of them small width either way the schema width is going to play a role the number of partitions in the stream is going to be there going to be an important factor if you have two partitions versus 20 versus 200 you will need appropriately the setup to handle that many partitions the type of indexes used for each column now you might use inverted index for some columns you might use uh knowing raw index for some columns and so on uh and that translates to that's a factor involved the time boundary or other limit uh the other line that limits the warm segments so if you choose to have 100 of your queries go over 30 days versus you know 99 of them go over two days uh it will make a difference you could you could optimize your setup so that you can you can derive better juice out of your hardware uh if you know the the way your queries are going to get going to get the rows the days that your queries are going to touch the segment size is important um frequency of segment completion if you're completing a segment every minute you're going to have a lot of segments um then you have memory available on the servers obviously more memory is better like always but you can optimize it you can tune it and then how many servers are you going to be using for the table if you have n partitions and if you have for example two replicas you really cannot do more than two n servers uh beyond that is is useless um so each partition each replica of a partition can be consumed in one server but typically we don't have to go to that length uh in in pino um and and in fact most of the early use cases are that you just you just get onto it you start it and it works um you can you can start it with a few hosts and then you can tune it from there but there are a few things to understand what goes on um it's useful to understand what goes on behind the back and then you are better equipped to tune your um your installation so one thing is we don't want too many big segments or too few small segments both are not healthy enough for pinot if you have a big segment then that's gonna uh take time the the queries are are executed and jackie will go with over this in a more detail perhaps next to the next presentation but the queries are executed one segment at a time so a thread is allocated for a segment and the entire segment is processed by that thread so if you have very big segment then your parallelism is going to reduce on the other hand if you have too many small segments then there is some overhead in terms of scheduling the segment and then you know switching it back to a new segment and so on so there's going to be overheads there so you really want a sweet spot and there is no right number here um we have often toyed with about 200 to 300 meg as a right number um but if you have 5 000 segments that need to be searched with 200 300 meg maybe it's better to make it bigger and search smaller number of segments either way um it can be scaled and pino is flexible in that respect so you can choose these these parameters to get to where you want to go all hot and most warm segments will be in memory are paged in depending on how your memory is if you don't have enough memory then they're going to be paged the server will experience paging uh if active memory is less than available memory that is you know generic truth of nature but it's good to keep in mind when you're tuning pino performance and that can result in higher query latencies because you know segments have to be paged in before before they can be scanned for queries or even slower segment completion because the segment build will take you know longer the segment completion is another thought another thing to keep in mind uh like i said periodically there is a segment that that the consumption finishes and completes the segment and then the new consumption starts and all that happens on a per partition basis the segments during completion we don't consume at this time now we pause typically segment completion takes tens of seconds uh so during that time the consumption is paused so you will get you know not so fresh data for 20 seconds or so um or depending on the size again a lot of factors determine how long um your segment takes to complete um it doesn't quick sorry to interrupt i had a well there's a question that came up that's relevant to this and i wanted to just confirm that this is like the accurate thing so ken krugler asked did you mean to say um too many small segments or two few big segments ah correct uh so so yes too many small segments is not good and uh too many uh and two big segments are not good either if you've got very large segments then that's going to take uh less away from your parallelism but if you have too many small segments then you're going to have overhead and switching so you really want to find a sweet spot of segment size okay cool thank you for clarifying i just wanted to make sure i was like i'm going to ask this now just in case people are watching this on demand later okay thank you okay so um so yes thanks for your question ken uh so segment completion does pause your consumption and it does increase gc overhead during the completion part and and then after that you will have less gc perhaps but there will be some gc while the segment is being built the server restart causes consuming segments to discard data and restart consumption you know we all work in live those of us who work in live installations restarts are a way of life and they can happen either with or without control sometimes we want to upgrade servers and we do a rolling upgrade then we restart servers and we don't you know look to see what state the segments are we don't have that kind of time um so it does cause the consuming segments to discard data and restart consumption now typically that's not a problem but there are things that you can tune if that causes to become a problem for you uh the takeaway here is we need to keep the working set in memory as much as possible or be tolerant towards paging one of the two a little bit more detail most of this probably will not affect most installations but it's good to know again if something pops up it's good to know what's going on here how do we allocate memory for consuming segment so in a consuming segment the rows get indexed continuously and they have to be available for query and so one would think of a consuming segment as something that starts like this and keeps growing so what we do is we actually pre-allocate memory for n rows um and the cardinality uh and depending on the cardinality k i for each column i right it's a very it's made available on a it's allocated on a columnar basis and it's all off heap which means it is again memory mapped for more for most parts there is heap memory uh used for inverted indices uh which is allocated on demand um so that heap memory is basically what your xmx heap that you that you specify when you start the jvm uh and and that can that is sort of unmeasured we don't we don't know how to measure it uh there's a question again from ken um what's the best way to estimate the working set size is it summing um uh summing the uncompressed segment sizes of all hot warm segments that would give you one way but i will give you a tool that can help you estimate that because it's very hard to to do that on a per per use case per table basis so the value of n is basically decided based on a predicted segment size so what we do is we choose the segment size when you configure the table you can choose a segment size so i want 200 meg segments or i want 500 max segments and then we adjust the number of rows so that we try to reach that 500 or and stay as close as possible and the pre-allocation is based on learning uh from the cardinality from the previous segments so the previous completed segments will give us some indication of oh this value this is you know data center id it's likely to be you know four or five maybe and this is uh you know a a link a user uri or something which is likely to be infinite you know that sort of thing so uh we we learn from there and then we get the uh uh the allocation based on that and then of course if a segment deviates from its previous segments properties you do see additional allocation we try to avoid it which is why we try to learn from previous segments and add a little bit extra to it because if we add if we make additional allocation that means in query time we've got to change these and scan through them or walk through them as as the query comes in so for high query performance that's not very very good so we try to not do additional allocation but it does happen occasionally the forward index and raw index memory they're all order of n and the value store is order of cardinality so as new values come in we build a dictionary and the value are kept in the value store and then there's a dictionary which is essentially a hash table of mapping of value to the id and that dictionary is also of the order of k the cardinality but there is a problem with the dictionary that the dictionary is actually a hash table so depending on the kind of values that come in in the stream you might hash at different places and so you might actually touch different areas of that memory the idea behind going through this detail is that uh it is useful to keep in mind that not all uh consuming memory is is just you know you only only the first part is touched or only the first part the first part is read-only and the last part is read write it's not necessarily the case there is some part that is read write as well and i think we all know that if you have read write memory then you're going to have more dirty pages on your on your system and then they have to be written back and and so on okay question from elon is there a conflict to control whether inverted industries are stored off heap we don't have a heap inverted indices as yet good question and we are looking for contributions in that area we haven't worked on it for a while but so far it has not at least in our installation it has not worried us too much okay so the takeaways here not all consuming segments memory is read right a lot of it is actually read-only some of it is read write but the read write portions is the one are the ones that are going to be paging more in and out and you're going to have dirty pages so it's good to know that that it's happening it's good to know look at your system see what's going on there if you face problems um the dictionary memory is unpredictable uh for large cardinality because if you have cardinality of you know ten thousand and sometimes eight thousand values come in sometimes twelve thousand values come in uh then it gets a little little harder you know depending on on the time of day if different different values are coming in in the segment and we made a previous segment we think it's ten thousand but now it's twelve thousand reallocation happens in that case and unfortunately that is uh you know fact of life in large scale systems consuming segments use some long-term heap memory uh like elon just acknowledged that the inverted indices is one and the segment size and memory allocated stabilize over time segment size we learn from the previous one like i said so the way we do that is we start off small and then we increase the number of rows consumed per segment as time passes and as until we reach the segment size and then we are like moving really slowly so it does take a few segments completion to get to the right size and similarly the memory allocated takes a few segments completion to get to the right uh right size right so because of this if you start a real time table today and do a performance test it may not perform as well as you let it consume for a couple days and then do the performance test you might get better results again this probably matters if you're doing really high throughput ingestion or very high throughput queries if you're not if your system is working fine just you're good it's good to know what's going on in the background and hence i i added this okay all right so uh what i've given you so far is all the variables and all the unknowns um that that you know are there and so how do you control uh so what we have done is given you some tunables in order to in order to do this so what are they there is a software called real-time provisioning helper it's in pinot admin command you can run that command and it helps you decide uh the segment size uh for your parameters for your set of systems for your how much memory you have so you can basically feed in your schema table config your injection rate your hot warm data let's say you expect five days of warm data that sort of thing and then the amount of memory that your systems have and it's going to spit out a few segment sizes that may be good for you and then you can also give it the number of hosts that you plan to run because if you increase the number of hosts then there is less partition consumed per host and therefore that also affects uh how you want to configure these so it's going to spit out some optimal you know configuration for your uh for your setup and so we always recommend you to run that and see what it says you know run it with i don't know two four six eight ten hosts and then it's going to come back and say if you have six holes then you do this much if you have ten holes and you do this much if you have two holes it's not going to work so it's a dash dash you know and so on recently we also added order generation of data to this so usually what used to happen was this provisioning helper used to take the uh an existing segment existing completing completed segment derive some characteristics out of it and then um you know give you the answers for optimal segment sizes but recently we added auto-generation of data so you just give it a schema and you give it some cardinality information and it's going to auto-generate the data and then compute some segment sizes for you some optimal sizes and you can use those um you can configure this pinot.instance real-time unlock off-heap that is the one that allocates off-heap memory for consuming segments which is you know predictable less gc when consuming when segments completes i believe this is a default now if not we should probably make it a default because the heap allocation for all the other dictionaries and everything is pretty much you know very very expensive in terms of gc because things get accumulated over time and then when the segment completes there's a huge gc with releasing a whole bunch of these things um there is the consumer direct directory that you can configure uh it allows you to choose the directory in which the memory map files should be present for the consuming segments uh because they are read right you might want to give special attention to them sometimes for example you could do use a memory based file system so that if there is continuously you know a lot of dirty pages they are written back they are written back to a memory based file system and then you don't incur disk uh penalty on those and lastly there is a the parallel number of parallel segment builds uh if you have 20 partitions being consumed on a single machine and you really don't want to have all of them you know go through segment build at the same time during completion uh there is an optimal number that you can set that that you prefer by default it is all of them will get will happen simultaneously and you might run into might run into problems you might not if you had sized your system correctly uh in the beginning um and then there is uh the recommendation engine that we have added recently it's called uh it's controller it's a it's a controller command so you can go to the controller and go to the recommender api table slash table slash recommender you give it some input and then it's going to come back and say here are the possible configurations that you can use for your table possible optimal configurations you can use for your table and and i believe it also runs the provisioning helper i'm not 100 sure at this time okay for batch offline tables is there an equivalent approach we do have we do have another question too so i was just going to wait until the q a do you want to finish up and then it's up to you you asked okay why don't we wait in the meantime okay um so in case your system is undersized what are the indications right um and typically you won't face this you you probably you know you you put a few servers you have a few partitions and you know things just run out of the box um but in case you're undersized in case you're you know choked up somewhere you know you find a sla is going bad what may what may you observe well one of the things that you'll observe is server run sort of heap heap memory and if it runs out of heat memory you want to ask well you know you could you could increase your xmx that will help on the other side you could also see if there's if this is happening during segment completion if so you can tune the number of segments parallel number of segments to be completed or if you have too many inverted indices because they're always on heap or if you have too many stream partitions being consumed in a single server um or too many segments committing at the same time um or a high qps um uh you know then you basically for all this the only thing you can do is to add more replicas or more servers so that the partitions get distributed across them um the ingestion burst may may hurt query performance so if you have a huge burst of you know rows coming in uh that can then just you know the the server is then busy uh to uh uh so too busy to serve queries uh so one tuning factor we have there is if it happens during restart then you can say well wait until you know some period of time before you go ahead and serve queries and that helps that has helped us in linkedin we are also trying to add a smarter way of recognizing when we have reached a reasonable limit to uh to start query performance that is work in progress uh but this is the set that setup that you can do today um so real-time consumption catch-up catch-up weight and you can set it to i don't know two minutes three minutes whatever you think is useful after you restart uh so that it can uh it can give you the right uh time to start serving queries you can throttle ingestion into the topic if that is possible unfortunately away from restart aside restart we don't have any other way to control ingestion burst if it comes in we do consume it we did try out some experiments on rate limiting those that's still experimentation um so we don't have anything concrete on that today more indications you have high query latency you know check the bottleneck are you paging paging is very common if you have a lot of consuming segments and what is the ratio of hot to warm memory to the total memory you have if that is high then you're going to be paging and then you've got to do something else you know for all of these typically adding more servers helps so if you started off with more number of servers you're likely unlikely to see any of these problems you can also consider using ssd for both completed and consuming segments or and or memory based file system like i said those can reduce you know paging and also improve query latency of course adding more hosts you can move completed segments to another tenant we have that available and we use that in our multi-tenant environment so that we can isolate all the consuming segments to one tenant and then the completed segments on another tenant to be characterized exactly like the offline segments uh you can add an offline pipeline to make it a hybrid table and that will basically limit the number of hits on the real-time machines and therefore leave enough room for for you know for them to consume rows you can use an optimal segment size as recommended by the provision helper and you can add appropriate indices but adding an index also takes away memory so it's a double-edged sword there if rows are showing up late it's really it's never happened so let me say that maybe segment completion is taking too long you might want to look into what's going on there look up the logs in the into controller and might show you something one likely case is that we have a segment store that we're trying to upload to and that's taking a long time and because there is a pause during completion uh the rows are taking a while to show up okay some of the futures um we want to reduce consumption uh during real time uh complete segment segment completion um we want to we want to continuously optimize the segment size so right now you give it a segment size and we learn to get to the right segment size but once we do that we really want to optimize it to the tune to the to your table so that you know bigger segment size or smaller segment size so it's better we want to do that we want to rate control and budget on ingestion so that if if extra ingestion comes in we just you know rate limit that in terms of consumption and we want the option to uh to use peer download only and no uploading of real time segments that can get you know better in terms of segment completion can take less time but it has a risk of course you don't have a stable storage for the segment continuing ingestion uh while completing the segment and off he converted indexes uh these are all in in you know in our wish list but we haven't yet gotten around to doing many of those and uh you know multi-tenant real time is a big open thing it's really hard especially well while serving you know a single single tenant had so many variables that i described imagine how much how it will be if you have multiple tables all consuming on one machine that can be really tricky so some of the key problems there are how do you rebalance across tables and what about capacity and headroom estimation and then you know the can we have a per table quota for ingestion these are some of the links uh in our um in our uh document uh in in docs dot uh uh you know pinot docks in apache uh this one is a blog that i wrote a while ago auto tuning pinot real time consumption so it's a good read if you followed this talk um we're always looking for contribution and thank you so much i think i exceeded my time a little bit oh it's okay better the content this was such a well put together presentation i'm really impressed as it's a lot of content and it's very dense it's really really good so thank you so much for taking the time and putting this all together because i'm like i'm like thinking in my head i was like there's definitely people who are going to be rewinding through this and like holding still on the slides if you do have those urls for the slides um you can send them over to me the ones that are just hyperlinked and i'll add them to the description of the video on youtube so people can access them easily um so if anybody has any other questions for subaru we're going to go to q a now so please ask now in the chat but i'll start feeding in the questions um so sadeem asks is the query latency going to suffer over time if my retention period is large and no segment increases over time number of segments increase over time um a number of seconds yeah i mean if yeah if you're going to query i mean that's that's that's what i would think if you're going to query 30 days and 45 days and 60 days and over time and yeah you're going to scan more data to get your results and i would imagine it's going to suffer but you can add more servers if you have the same number of servers it's probably going to increase but if you add more servers then it's going to be distributed and then you're going to be good okay um ken asked for batch offline tables is there an equivalent of real-time provisioning helper if not then wanted to circle back on the question of how best to estimate memory needs with replication yeah good question so um for offline tables the closest we have is the recommender uh link that i that i give it's fairly easy to do the offline estimate because for memory purposes it's the segment size so if you have your segment sizes or if you have your rough data size that is your memory used for offline and so you can then choose a ratio of that with your available memory to decide how much paging you want to take and how many how many how much latency you're willing to tolerate um okay and then elon asks what is the best way to manage controller responsiveness during bursts of ingestion due to uploading many offline segments on the controller um okay the talk was more do real-time ingestion but yes if you have multiple ingestion of uh on the on the on the controller many things can play a part there you your your deep storage could be slow because of which you know controller is holding up threads and it's not able to release them because those threads are slow there so writing to deep store or if you have i mean it's unlikely that you have so many segments that the controllers are overwhelmed but you could add if your deep source fast enough you could add more controllers we do scale up to 20 plus controllers per cluster if you did have more than 20 controllers just something else wrong with your cluster you probably need to split it um i just want to mention too before we go to the next question we do have another presentation coming with jackie on optimizing query performance which index to use and filter optimizing and segment pruning um so that's coming up right after the q a so please hang on if you guys are still here you know we're doing the q a now um ken's other question also how best to size the jvm broker memory needed for consuming groups of cert from servers to construct the final result yeah the broker is broken memory sort of depends on your queries if you're doing distinct queries then you do get uh you know the exact results streaming from the servers to the broker and there's huge memory to be tackled there but if you're doing aggregations you're probably okay so it's we don't have a formula on that sorry maybe jackie can have something under his sleeve so he'll yeah we'll bring him back in he can he can add to it if he wants to beforehand um and is there a way to stagger throttle segment completion to avoid a thundering herd yes so that is the uh configuration that i gave you so you can limit the number of segment segment completes per host um and there is not a way by which across hosts you can stop that uh because all of these go to the same controller uh to to the lead controller to to do the completion but the controller part of it is really small the only the main thing is actually building the segment and so if you can control that on a per host basis you're good and this worked for us nice awesome we do have some people who just commented great presentation that was pretty slick that's a fun comment um awesome yeah super this was really really good thank you so much for taking the time you can definitely tell that you took a lot of time and thought on like putting together this presentation and i also appreciated the humor because i was laughing backstage are you making jokes so that was really good yeah thanks green for organizing it yeah no this is awesome oh we do have another question that just came in so we'll ask this question if there's any other questions we can either ask it in slack or we can do it after jackie's presentation um oh wait a minute same question let's see how does the real-time upset on older data affects the ingestion performance uh i think it does affect i don't have a measure of it at this time sorry okay cool all right yeah it's also maybe something jackie can answer jackie comes on and he wants to like tackle a couple of the questions beforehand and it's good too um subu thank you so much thank you all thank you we definitely i know this is probably a lot for you to prepare for because like the content was definitely awesome and like very well put together but i definitely think that you'd be highly valuable doing present presentations again in the future sure yeah the community all right so i'm gonna put you backstage and i'm going to bring out our next speaker um jackie john he's a founding engineer at star tree um apache pinot ppmc and committer um so he is his little background is that he is a founding engineer at star trek which i just said before that he worked at linkedin pinot team for four and a half years and became the pmc and one of the top committers for apache pino uh jackie's goal is to make apache pinot the fastest online analytics platform on the market which i love so much hi jackie hi corey hi um so yeah if you have any if you have any feedback on the questions that were asked you can feel free to add to it or we can just jump to the presentation you want to just get right into the meat or yeah i think i can add something to the last question about the real time officer so for officer because it requires all the segments for partition to be on the same server uh so and also like the abstract metadata is mentally memory so basically i don't think there's like actual overhead to observe on older data versus new data so yeah basically that's the question so uh there's no obvious overhead around that awesome we do have a final comment and just up segment flush periods what is optimal is this is still a question but we might be just going into q a forever um if you want we can also tackle this at the end of your q a too yeah and also i think yeah this one you can use the tool subaru provided so different use case like you might want to configure different flash periods based on your segment size and basically the tool should give you a good recommendation cool um we will for those viewing uh we will be um doing a q a afterwards and we can also bring subu back out if you have more questions for subu before um you know we end um everything so we will be doing a q a afterwards we'll bring it back out when he's if he's ready for it okay jackie i'll let you take it away um when you're ready cool cool i'll see you at the q a shortly sure so yeah feel free to interrupt if yeah there are any questions and then interrupt if it makes sense but we can also save it for the q a at the end cool cool um okay so thanks subu to give like a really good presentation on like how to optimize the real time performance and then subos targets uh more about from the data ingestion and the data management perspective so my talk will mainly talking about the how to optimize the query performance using the cracked index and also some like special techniques to like prune segments so first so subu already gave a pretty decent overview of the pinot architecture so here i will give a brief overview on the components that are used to execute the query so the components involved here are brokers servers and segments so first when broker receives the query it calculates a routing table and then draw the query to the servers on server gets a query it executes the query on all the segments not necessarily all the segments on the on the segments uh hit by this query and then after getting the segment responses it merges the segment responses and then the broker gathers the responses back from server the on broker side and merges the server responses then it returns the final response to the client so this is the life cycle of hopi note server query so let's go over each component in this graph so first component is broker so broker maintains the mapping from server to segments hosted by each server and when receiving a query broker is responsible for calculate the routing table basically the routing table is like where all the segments hit by this query located and the which server to route the query to and then broker scatters the query to the servers hosting the segments and the last step is after server responded so broker gathers all the server responses and merge them and then return the final response okay so let's take a look at the server so server first it holds the segments and the when it receives the query from the broker it executes the query on the segment and then it merges the segment response segment responses and then send the assume the server response response back to the broker okay so last component is segment so if we zoom into the segment we will see so it wraps a block of records and the segment is the smallest query execution unit within pino and within segment it contains the column indexes so uh pinot stores all the data in columnar fashion so within segment you'll see indexes basically it's not the index is not for each row but for each column okay now let's see how a query is executed on a segment so for example let's take a look at this very simple query like if we want to uh do a sum on column a where column two equals uh one two three so to solve this query there are two phases first we need to do the filtering basically we want to find all the records that match the predicate column two equals one two three and then after we get all these records then we want to do the aggregation where we read the column1 values for all the matching records and the aggregate so then let's take a look how to optimize the performance for this query so from query execution uh so broker first get the query then they got the query to the server then server execute the query on the segments so from performance optimization perspective i want to go bottom up so let's first take a look at how we can optimize the query performance on segment then server than broker so first let's take a look uh how to operate optimize the query performance on a segment so we'll still use this query as an example and then in order to solve this filter if we don't have any special index on this column in order to solve this filter we need to scan all the values for column two to find out all the matching records basically we need to do a full column scan and then find other values that that i want from other records that has column two value one two three so this can take majority of the query time uh because basically we need to scan every individual values for this column and especially when the selectivity is high uh what what selectivity is timing is like there are only very few matching records uh the reason being like if there are very few matching records then the aggregation phases will be like very cheap because we don't need to process much data then the majority of current time will go to the filtering phase so here we want to optimize on this filtering so first index i want to introduce here is inverted index so inverted index maintains the mapping from value to the records that match the value and then the the records are stored as a bitmap of the matching record ids to be more specific so here let's take a look at an example for example i have a table i have the segment that has the following records so as we can see like uh for record 0 column 1 has value 5 and then record 1 has value 10 and so on so if we convert this into a inverted index it will look like this so basically it is a map from the column value to the record id for this value so as we can see here so for column one value five there are two matching record ids zero and four and for column one value ten there are also two matching record ids one and three and so on so with this inverted index we can solve this kind of filters first let's see like if we want to solve the filter column y equals five we can directly look up this inverted index and found okay there are two records matching this value similarly for column y equals 10 and one two three it can be easily solved with just one value lookup and then if the query has for example filter column one equals one then after look up the map we found okay there's no matching values the we know okay there's no matching record then basically we know there's no like we we can skip the aggregation phase because there's no records matching this filter okay so inverted in it inverted index is the most commonly used index to optimize the filtering performance and it can be used to solve these predicates so as we already covered it can be used to solve the equal predicate for example column y equals one it can also be used to solve the naught equals so to solve this not equals predicate basically we can first get the bitmap for the equal predicate then we can uh basically uh reverse the bitmap basically like we can um we can i would say subtract the speed map from the whole document range then we will get the result for this non-equal predicate so similarly we can solve the in predicate by a union the bitmap for all the values in the in clause for example here column one in one two three i can get the bitmap for value one value two and value three and then do a union on top of it similarly for nothing i can first do a union on this values and then reverse this bitmap from the whole dark range but let's take a look at another type of predicate range predicate so inverted index becomes less efficient for range predicate for example i want to solve a filter column one larger than phi the reason being in order to solve this i need to collect all the possible values for column one and then i want to union all the bitmaps but then there might be too many matching values for example i have value six seven eight blah blah blah and then i so in order to solve it with inverted invest i might need to uh do do a union on let's see a lot of bitmaps so like merging so many bitmaps can be very costly so to solve this problem i want to introduce another index called range index so range index is designed to optimize the range predicate it maintains the mapping from value range to records matching the range so as we can see it is quite similar to the inverted index the difference is the key for this mapping is the value range instead of each individual values so let's take another example so for example here's my segment data and the to convert it into a range index it'll be something like this so the key is actually a value range so here we can see like the first entry is for column one of value range zero to ten and then from zero to ten we found two matching record records zero and four so the value is three and five for these two records and then similarly for value range 10 to 20 there are another two matching records one three and the four value larger than 20 then there's only one matching record too so let's see to solve so using range index we can easily solve this queries uh these filters like for example column one smaller than ten then we know okay and only we look up this uh range index mapping and then found okay only run zero to ten match this filter then we found okay record id 0 and 4. and then column 1 smaller than 20 similarly we found column 1 of range 0 to 10 and 10 to 20 match this filter so we do a union on this two record id set and then get the final result zero one three four and then column one larger or equal to twenty we found okay it matches the entry 20 to thirty then if okay there's only one matching record id which is record two so so far we have seen all the so all this filter match the value range boundary so how about this filter column 1 smaller than 15 so we found okay there's no like no value range that has 15 at the boundary so how do i solve this query so so basically this case is when value is not at the right boundary to solve this query we look up this similarly we look up this uh mapping and then we found a value range zero and zero to ten it's always falling into this uh predicate range which is larger than fifteen so we know records 0 and 4 always match and then value range 10 to 20 matt might match this range predicate which contains records 1 3 and then 20 to 30 won't match because this whole value range is larger than 50. so in order to get the final result we can scan the records within 10 to 20 range which are records one two three and then we found record three matches and the record one doesn't match then we can add this actual record to the always match record ids and then get the final match records zero three and four so two summaries to summarize for range index range index can be used to optimize the range predicate and then when the right wrench predicated value is at the range boundary of the range index there's no scan required when the range predicate value is not at the range boundary of the range index we need to scan the records for one value range but we still we don't need to do a full values uh full like full column scan so for example if we have by default we have 20 value ranges so in this case we only need to scan five percent of the data the average okay so far we have covered the range index let's take a look at another very powerful index called sorted index so sorted index is very powerful and it can be used to solve all the predicates mentioned above economical in naughty and also range predicate but it requires the values to be presorted within the segment so it cannot be applied to all the columns since it requires the column values to be sorted in most cases it can only be applied to one column and usually we would like to apply it to the primary primary key column so primary key column is not a like concept in pinot but for for other databases basically you can think of it as like a column where you you most likely put a filter on so for example we have uh this data set and then to convert this to sorting as we can see here so the column one values are already sorted in this segment so to convert this into a sorted index we keep on mapping from record value to the uh to the uh to the column value from the column value to the record id range because the column is already sorted so record id is always a wrench and then basically this range are also sorted so here let's take a look at how to use sorted in sorted index to solve the curve so for equals not equals in mounting sorted index basically we can use it as inverted index because it also maintains a mapping from each individual column value to the records and then we can easily convert this record range into a bitmap for rent predicate because the values are already sorted we can perform a binary search on the values and get the matching records ids for example we want to solve this predicate column one larger than three so we look up this sorted index entries and then we found okay the the first matching value is five so i want to so basically the record id will be everything after a column value five so we found column value five has record range from two to two so we want to pick all the records that has id larger than two and then basically in this use case it is two to four so one thing i want to mention here sorting index in general has better performance than inverted index and run index because of the data locality here the selected records are usually uh usually like stick together so when we scan for the aggregation phase we get better performance so so far we have covered the most commonly used indexes type we also have other indexes to optimize for some special predicates so i'll just listen here so but i i won't go deeper into it because for for for for time's reason so first we have this non-value vector index uh essentially it's a inverted index on the on a column to store either the column has non-value or uh not non-value so it's used to solve that it's not and it's not null predicate then we have the losing tax index which can be used to solve the tax match predicate we have losing fst index for the regex like regis expression a regular expression like predicate then we have the json index for json match and h3 geo index for the geo distance range predicate so several of them are already covered in the previous meetups so you might be able to look it up in our youtube channel so so far we have covered all the indexes to optimize the filter then let's take a look at the next phase the aggregation phase how to optimize that so if there's no filter or the filter has very low selectivity meaning most records match the filter then the aggregation phase could be the bottleneck because it needs to scan values from lots of records so here we have a special index called star tree so star tree index basically pre-aggregates the records and then it can significantly reduce the values to be scanned during aggregation so star tree index is quite complicated i won't cover the detail of it in this talk but it's so powerful so we named the company after that so if you are interested please read more about star tree in this link so basically we have a blog describing the use cases and then the layout of the star tree and then it is all also available in the pinot dock okay so so far we have covered all the techniques we have to optimize the query performance on a given segment now let's take a look at how to optimize the query performance on broker and server so with all the indexes i already described in the segment we can achieve very good performance executing query on the segments but what if we want to do one step further so for some super large use cases for example like if a table has padded byte of data and then let's say 100k segments or some use case might have super high throughput requirement for example like more than 10k qps we want to further optimize the query performance to meet the sla and then also probably reduce the cost of the hardware so here on broker and server we can use some techniques to reduce the segments to be processed for each query so first let's look at the query optimization we have on server the bloom filter so let's still use the same example so to solve this predicate column two equals one two three we can configure a bloom filter on column two to quickly filter out segments that do not contain value one two three so blue filter is very efficient uh usually on high cardinality columns like whose value don't repeat a lot among segments for example transaction id is a great example for applying bloom filter if we want to find the records of a given transaction the query will be like select star from table where transaction id equals uh uh like probably a uid or something then basically a transaction id won't repeat a lot among segments so this second transaction id might only exist in one or two segments so with bloom filter uh for example if the false positive rate is uh three percent that's a common false positive probability then here uh in regular case around 95 percent segments will be prone out so we only need to process five percent of the segments uh so bloom filters us because we need some actual storage for balloon filter so bloom filters are stored on server side along with the segments so next let's take a look at the query performance optimization techniques on the broker partitioning so let's use the same example so we can partition the data on this column two so that each segment only contains columns column two values from the same partition and then with such with the data partitioned in this way column two equals one two three will only hit one partition uh we'll only hit the partition that contains the value one two three so in this case we only need to pick the segments from this partition in order to solve this query because we know the segments from other partitions can never contain value one two three so partition info is stored within the zookeeper along with the segment zk metadata so it is accessible to all the brokers broker can prune the segments based on the partition info and only route the query to the matching segments so it doesn't even need to route the query if the segment is not in the correct partition so let's take a look at different partitioning mechanism in pino so first that's time partitioning so by nature time series data is usually partitioned by time for example if we push segment each day then the segments will be the data will be partitioned on each day so for example some queries like select select something where even time between august 1st to august 5th so this is a very common query and also for hybrid table uh integration of offline part and real-time part there's always a like there's always a time filter attached to the query so the time partitioning can also help with that use case so for this query segments for all other dates are prone because they won't match this time predicate another type of partitioning is called value partitioning so pno also supports partitioning based on the value of the column so one most commonly used partition function as murmur 2 so murmur 2 is a default partition function in kafka so by using partition function murmur 2 we can we can automatically get basically we can automatically get the partitioning info aligned with the kafka consumer and the real time consuming i mean then the real time segment will get this partitioning so for example one example for the value partitioning is to solve a predicate member id equals some value so with if we partition our member id the we only need to uh pick the segments from this partition which contain the member id value one two three four five six seven and only process these segments uh so if we let's see if we have 20 partitions the only one partition only the segments from one partition need to be processed meaning five percent of the data then another 95 of data are prone okay so to further optimize the query performance for partition data so we can configure the pinot to assign the segments for the same partition to the same server so by doing this we can reduce the find out from broker to the servers because let's see i only hit so a query only hit one partition the i only need to query the segments for that single partition then i only need to query one single server so this topic is already covered in a previous meetup segment assignment so feel free to check it out in our youtube channel so so far we have covered all the techniques to optimize the query performance we do have some indicators from the query response so if you send a query uh in the pinot query console you will get some uh you will get a response similar to this so as we can see in the in this query response on the top we get the query response stats including the time used and the some other metadata and then if you choose to show the json response you will see all this metadata in the json so these uh query response stats are very good indicators uh for opportunities of queer optimization i'll go over some of the important indicators here so first uh there's a there's a stat called number interest scanning filter so it's basically it costs the number of values scanned in the filtering phase to solve the query so if we found this value very large then probably we want to add some index in order to reduce the scan during the filtering phase for example the inverted index and render index and there's another so the name is quite similar it's called number entry scan post filter so 8 it it marks the number of values scanned after the filtering phase so for example the value scanned in selection aggregation so in this case if we found this value very large then we might want to add star tree index and do some pre-aggregation on the records to reduce the the value scan another set of indicators uh number segments current number of segments processed and number of seconds matched what does this three values mean so first the number segments queries are the number of segments created by the broker so enable partitioning can reduce this value and the number of segments processed means so basically number of segments create are the number of segments uh serve the number of segments server get from broker to to process and the number of segments processed are the actual uh number of segments processed by the server after server processor segments so adding bloom filter can reduce this value then the last one is number segments matched so it it is the actual number of segments that have records actually matching the filter so if this value is much smaller than the number of segments processed it means we still have some opportunity to optimize the server segment pruning to promote segments without processing them so in the end we have another uh two stats about servers so the first one is number servers query so it is the number of servers screwed by the broker for this query so enable partitioning and also the partition based segment assignment can reduce this value basically this is the fan out of the query then another one is number servers responded so this is the number of servers responded and if this value does not match the numbers of square number servers queries it means some servers didn't respond first of all like some some observer timed out and then the result is not complete so user can choose to retry the query or accept the partial response based on the use case yeah so that's all about my my presentation and then i didn't cover how to configure the index so you can check that out uh in our pinot document okay so let's do this jackie yeah i can't believe how often you do this this is crazy like this is so much information and like i'm really impressed i'm week after week you're here doing this so it's awesome it means a lot that you you know are taking the time to like help people figure out all the stuff that you've been working on um i'll bring i'm sorry sorry no you say no i'm i'm saying like since everyone who work on this project and build so many awesome features so i have something to talk about that's great all right so i'll bring super back out and i'll start bringing in some of the questions because there has been some questions conversations i'll bring them back in so hi subu welcome back um okay so i'll start here um with sadim if column type is not in or long if it's string then will query performance not improve as much i mean is this a question about i mean in general string has yes a slightly worse performance comparing to intel long but if the if the values are dictionary encoded then usually the overhead is just when you look up the dictionary after that we use fiction id so the performance will be the same so yeah i think that's the answer so the short answer is as long as the values are dictionary encoded you won't see much worse performance subaru did you want to add to that i see you nodding oh your your volume is i think your headphones your sound is off okay so there we go so yeah the other part is if you use no dictionary if you're if your cardinality is generally small then yes that you can use dictionary and that's fine but if your cardinality is huge and you're not going to search on it then you can use no dictionary as well and that way then we only pull out the rules that you know you are that are finally filtered out in the query cool okay uh how performant uh the inverted index when the column is carbon cardinality is high or unique and also mayank chimed in here so i'm just going to show his response uh setting inverted index and a high cardinality column typically helps as it can help prune out larger numbers of records from scanning um do you guys want to add anything to that nope you said it okay all right only thing you might want to consider is the index size but performance is not the problem in this case it's mostly about index size because for each value entry there is a bitmap but we use joining bitmap which is fairly good compressed so in most cases i would say you don't need to worry about that okay so here's another question with a little bit of a conversation so i'm just going to put it all on the screen now over here uh so for sorted index can we use multiple columns ken responded i think only one column for sorted since the entire segment has to be sorted by that field and i assume not a multi-value field and then elon to clarify that i meant something like a composite key is that possible example timestamp version or is this not done for performance reasons jackie you want to take that poster uh i'm trying to understand this so are you saying sword on a composite key uh elon can probably clarify we can ask him to clarify we can jump the next question and have him elaborate but i can add a little bit on the sorted uh column so sorted column is automatically figured out by pinot so you don't need to explicitly configure it so when pinot generates a segment it will automatically figure out whether all the values for this column is sorted and then if so it will automatically generate sorted index for this column so for example uh let's see if if you have some columns that only have one single value then it will it is always sorted so in that case you can have multiple sorted columns so basically pinot will automatically figure out which column is sorted and the applies ordered index so for this compulsory key i think i i might need to understand the question i think he's asking if there is if pino supports the you know first order second order third or sorting uh using composite keys and the answer is right now we don't um but i've seen discussions about this um so who knows when it's going to come by and going back to the sorted key sort of column what jackie said was correct about the offline segments for the real time uh you can specify and we encourage you to specify one sorted column because the uh rows don't arrive in a sorted order in general so so we ask that you do specify one column that you want to sort on which is your primary key that you're going to search on most of the time and then we uh we start on that basis while building the segment yes thanks super for adding that yeah so yeah actually sorting index needs to be configured explicitly for real time segment and then during consumption it won't be sorted so we will use inverted index and after the consumption is done and when the segment is committed we reorder the records based on the sorted column so the records are sorted on that column got it okay uh can applying too many indexes or providing too many columns in star tree lead to an out of memory error how is the memory usage handled wow this is a yeah good question so based on this question i know you already have a pretty good understanding of the star tree so the short answer is yes if we add i'm not applying too many indexes but if you are too many dimensions to the star tree and then all these dimensions for example are all with high cardinality and then there's not much records with like common dimension combinations then first of all there are not much pre-aggregation happening there and second yeah you might run out of memory and basically there we we are trying to pre-cube everything on on the huge number of combinations and then even if it works i i i doubt if it can give good performance and by default if if you don't explicitly configure the star tree we will only pick the dimensions that's not space too high continent to pre yeah to prevent it running out of memory um okay another question um is creating the star tree index on heart cardinality column tell me unique cause any issues like more creation segment creation time or anything else yeah i think i just mentioned the the case where creating a star tree on a a very high kernel column but actually so if you have a high quality column and then you want to filter on that usually inverted index is good enough because you already got pretty good selectivity and then you won't get a lot of records that need to be pre-aggregated uh elon said great presentation this is amazing okay um another question uh can multiple indexing techniques be applied on one column how does the cardinality of column affect indexing and querying uh the answer is yes so you can apply all the these indexes to the same column and then based on the predicate pino will choose the most efficient index to solve the query and how does the cardinality of the column affect indexing and query so cardinality of the column won't based on index type it might or might not affect the query performance and the index size for example inverted index higher cardinality means a larger index size usually but it won't affect crew performance much for range index cardinality doesn't matter because it stores low value range so the answer is like depending on the type of index um so i mean we might have to put some kind of restriction on the questions because they are flooding in there's a lot so i think we'll like we'll let one more question come in and then everybody else if you have other questions you can ask jackie and subu on the pacupino slack they are very active there so you can also if you love the presentation you know please give it a thumbs up on youtube and also you can thank them uh publicly on twitter or in the slack or wherever you'd like to um okay uh i'm not even gonna read this is this good to use in a query or no just go to user regex expression i mean uh i i i mean it it could be a expensive query but still like depending on your use case i i wouldn't see like rugged expression cost problem like especially if you have the losing fst index configured so for right if if without any index regardless expression uh so basically you you need to scan the values for that column and then match that drag x value by value but we process that in the end so if you have inverted index on other columns in the filter then you can still get pretty good performance so i wouldn't say is like don't use it like we provided for for you to use right basically um okay uh can offline segments benefit from manually partitioning them similar to real time we do have partitioning um you know allowed for both so in fact uh you can partition you can set your partition function and then partition the offline segments so that the pruning is done at the broker level and only those segments that fall into the partition are considered for query and we use that in multiple use cases at linkedin cool um this might be somewhat of a specific question here um what is the recommended indexes for i'm not going to read this because it just would sound ridiculous if i try to read it so if you have uh depends whose recommendations you like jackie's recommendation or my recommendation just kidding no i would be interested in hearing that i wouldn't be interested in hearing the opposing views here this actually that's not a bad idea for one of the future meetups as we tackle some problems live on air and see how uh you know the responses to it that could be fun yeah so actually i have this example in the time pruning part so usually if you usually if you do like daily segment push and each each day the segment contains the same even time then the filter can already be handled by the time pruning so you will only hit the segments falling in this value range then after that if you found okay there are too many values scanned to solve this these two sum aggregations then you might want to consider adding star tree uh by the way super if you have anything to add to it please just add if i'm jumping too quickly between questions can star tree perform range filtering um without range index or sorted index let's start to perform uh sorry i'm not 100 sure uh the conceptually it can but uh the reason why i'm not 100 sure is ranch filtering so basically in star trek each value node it's structured similar to inverted index and then actually i think it yeah i think it can perform run filter so without yes without range index and sorted in this so star tree index is a special index so it wraps uh other type of syntax inside so basically you can think of star tree index something like combining uh inverted index and sorting the next within because we we reorder the data within star tree so you don't need actual index in order to use starter index okay um we're getting towards the end there's i think two more left so um these are good questions so i'm just i'm just rolling with it uh according to the docs the raw value forward index is not doing any particular index that means no index can you please talk about it what exactly does it do sure uh so what do we want to yeah sure this is what i had mentioned a while ago about no dictionary columns it's also called raw index which means that you're not going to have uh filters on those columns and therefore we don't have to look it up uh during during uh uh you know selection so then we can come or you're going to look you're going to only project those columns at some point and then but the selection of rows has been done by other filters that you have so so yes that means basically the raw values are stored in order of rows okay cool uh thank you jackie and suvu oh my god there's more questions coming in i don't know when to cut this off so if you guys if this is too much we'll just yeah i need to have for my next meeting soon so okay so uh we'll take this one and then we'll take elon's kind of clarifying question and then we'll have to cut it off unfortunately unless jackie wants to stay on and just answer questions for the next three hours uh this became like an optimizing query performance in pinot q a i think the q a is like almost longer than the presentation uh does the primary time column format have impact on the query latency i mean if it is a time stamp as a long value then will it be better or can we use a date format with ddm yyyy format um uh the time so what matters more is the granularity of the time so for example if your time is stored in daily granularity then basically you get much lower cardinality for that for the time column if you store in milliseconds then the currently could be very high within each segment so it doesn't matter on the actual format by the by the granularity of the time okay yeah i think elon kind of addressed this too version for the sword it's similar to pro composite key um okay and then this is the last question so we'll just ask it because we're here this is the final question and then we're going to wrap up how do the null values in a column uh affect the indexing on that column that's a good question too how do the noun in a column uh yeah good question so non-value internally we put a default value for all the non-values so if you want to explicitly query for non-value you can enable the non-value vector which is like covered in my slice and then other than that now values are stored as some default value which you can configure within the schema for example for metric usually they put 0 as a non-value so it won't affect the most common aggregation of the sound so for from index generation perspective it will just treat that as a default value nice awesome um thank you guys so much we're going to end this now because otherwise i think the q a will just keep going on forever because there are a million questions we can ask but uh for those watching it doesn't the discussion does not have to end here feel free to join the apache pinot slack um and ping jackie and super you can just ask in the general channel or troubleshooting um and there's a bunch of people there who can probably help out too they're our community um the contributors are pretty um active so thank you guys so much for taking the time and putting together these awesome informative presentations really awesome good stuff really really and subu we have to have you again when you're ready i know it's a lot of work to do this and we went 40 minutes over the time so you know i appreciate you guys putting the energy and effort towards doing this this really means a lot so all right and until next time guys don't forget to comment like subscribe you know give it a thumbs up if you like this presentation and we'll see you next time ya
Info
Channel: StarTree
Views: 245
Rating: 5 out of 5
Keywords:
Id: nf11Q97-d8k
Channel Id: undefined
Length: 98min 46sec (5926 seconds)
Published: Tue Aug 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.