Apache Pinot Features: Derived Column & Realtime To Offline Job

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone uh thank you all for joining um today we have two great presentations um derived column uh well actually user defined function um derived column and pino managed real time to online flows um so i'm karine i oversee community at uh star tree and um we have two amazing speakers today i'm gonna bring them in neha and jackie hey guys um so we'll uh we'll be starting off with jackie's presentation um and afterwards we'll do a q a and five minutes approximately then we'll jump over to neha's presentation then we'll also do another q a after neha's presentation of course if you are watching live feel free to ask questions directly in the chat on youtube and if you happen to be watching this on demand afterwards um you're also welcome to join the apache pinot slack which i posted will post right in the chat right now you can join that apache pinot slack where you can see neha and jackie you can ping them directly and do what i do and just send them messages consistently throughout the day yeah um all right so we'll get started so neha i'm going to put you backstage real quick and i'm going to give a quick uh introduction to jackie um so if you haven't met jackie yet probably tells me that you haven't been to any of our meetups because he's been presenting pretty much every week consistently jackie is a founding engineer at startree before that he worked at linkedin pinot team for four and a half years and became the pmc and one of the top committers for apache pino his goal is to make apache pinot the fastest online analytics platform on the market all right jackie i want to let you take it away and we'll see you at the q a after your presentation cool thanks karine hi everyone so i'm jackie and today i'm going to talk about the derived column in apachepino so before getting into the like getting into the like derived column concept let's first look at a use case example so assume i have a table called as clicks table so within this table uh the table basically tracks the views and clicks for the ads so here are some example records within the table so this table contains four different columns the first column is the sid the id of the eyes and then in this table we track the number of wheels and number of clicks for this ads at the given timestamp and as we can see the timestamp here has second granularity and here we simply drop the milliseconds part for this timestamp okay so let's look at some uh queries so for example here for this uh i want for for this analytics i want to know the number of clicks per hour on 2020 i'm basically on today so here the challenge here is i don't have this early granularity on my timestamp but i want to check the clicks per hour so here what i can do is i can select the sum of clicks and i can use this two epoch days function on this timestamp to make it into different granularity for example these two ad hoc days i can convert this timestamp from seconds granularity to days and similarly for this group i can use two epoch orders to get this time stamp bucketed in hours so that i can get the total number of clicks per hour okay let's take another example for example i want to know the average conversion rate since july so the conversion rate basically means the clicks divided by wheels or like how many uh or like basically i'll just make an example like uh for isid one at time stamp july 27th i got five wheels three clicks so basically the conversion rate here is 60 in order to get this average conversion rate i need to calculate this clicks divided by views and then do an average aggregation on top of it so as we can see in these two query examples i need to do something called transform function what is transform function transform function is basically for each record we we can calculate the value based on the value of some columns so here are the examples shown in above queries these three transform results transformation results it doesn't exist as a column but some transformed results so besides so all these three transform functions are inbuilt transform functions besides these functions p naught also supports user defined functions or udfs so currently pinot support groovy script and also the annotated java method as the udfs so as we can see in the above examples it's not possible to achieve that without transform function so basically transform functions add grid flexibility to pino but also add a cost of worse performance comparing to directly filter or aggregating on columns for example for the first query in order to solve the filter to upload base timestamp equals july 27th i have to scan all the timestamps uh within each record and then try to match that value whereas july 27th another problem is because this transformed results does not pre-exist in in my data set i'm not able to add any indexes to the transform results okay so now let's think how i can like basically optimize this to get better performance so one idea would be okay maybe i can pre-materialize the transformations so that basically this transformed results are already pre-calculated and then i can directly use it so the first approach would be okay let me pre-process the data before i ingest it into peanut so i can probably do a hadoop or spark job and then pre-process the data to generate the column for example pre-bucket the timestamp into hours and days and then put a separate column for it but to run a hadoop spark job i need to take actual resources and also i need to maintain actual workflows which is undesired because basically it's too much effort then luckily within peanut we support uh a feature called ingestion transforms what it does is instead of calculating the transform functions at query time uh pinot can pre-materializes the transformation results into a column during data ingestion without a separate data pre-processing job so two two config ingestion transforms is quite straightforward so basically all we need to do is put put this transform config within the ingestion config in the table config for example here we can put two separate columns one is called this timestamp and it basically is the timestamp bucketed in days and another one is rsts which is by timestamp bucketed in orders so with this injection transform function uh columns actually these pre-materialized columns are physically stored within the segment so basically it acts exactly the same as regular columns and because of that indexes can also be applied to these columns for example we can apply inverted index sorted index range index blue filter and also star tree index okay so since the problem is solved but what if i already have a data side and i already have plenty of segments data within the table i don't want to like uh reboot strap the table with this pre-materialized columns here's where the derived column is shining so basically the concept of divide column is very simple is column width is values calculated from other columns for example in the above queries these three transform functions can be modeled as graph column and then what's special about derived column as the wrap column can be generated on the fly when segments are loaded on the server so what does it mean so it means let's say i already have a lot of segments loaded on server side and i don't have this derived column yet in order to generate it i only need to configure the derived column the same way as the injection transforms like this and then i update my index config within the tableconfig i either derive columns into the schema then i can simply trigger a table reload table reload basically like make the controller send the message to all the pinot servers to ask them to reload all the segments for our table during the segment load the generate derived column will be automatically generated and the after reload is done derived columns are ready to be queried without downtime so actually this whole process is transparent to the user like user you even won't be able to find find out that the table is in reloading state so let's take another look at the two examples shown before so with derived column in order to get the number of clicks per hour on 2021 july 27th so the query will look exactly the same as like occurring the regular column so here we can directly use the dates timestamp and aura's timestamp similarly for the average conversion rate i can directly do our average on the conversion rate okay so furthermore the derived column can be paired with the on the fly index creation so basically we have the following indexes supported whereas on the flight generation inverted index range index bloom filter and also the most importantly star tree index so i want to mention star trek index separately because with star trek index actually you can do pre-aggregation on top of the derived column so for example for that for the average conversion rate ah sorry no for the for the number of clicks a group on per hourly bucket actually we can apply a structure index on the early timestamp so that we can pre-aggregate the results for the total number of clicks so to read more about star tree index which is a very powerful index and exclusive in genome please go to our documentation and then find more details there so where's this on the fly derived column generation and index creation to optimize the slow query becomes very easy so first we can add derived columns then we can apply indexes both of these two steps can happen within like by updating the table config after like updating the table config simply trigger a table reload and then you can just sit tight and observe performance boost without downtime okay thanks so the derived column feature is supported since pinot release 27.1 and that's all about my talk any questions thanks jackie um so yeah we do already have some questions um just putting it out there if uh any of the attendees watching um if you have any questions for jackie please ask now um we do have a slight delay in the video for when it like streams live so i'm just gonna put in here um we do have one question that just came in um while you were presenting um so yeah sharma said actually i think i can bring it onto the screen uh in practice it is cool right i haven't used this yet so like i'm really excited about it in practice trade-off between derived column and pre-materialization especially the overhead of index generation and latency uh the overhead off so trade-off between derived column pre-materialization uh this is a good question so basically pre-materialization [Music] uh let's see so basically the the overhead is on the segment reload happening on the server side and then pre-materialization basically means we already pay this price at segment creation time so there's no overhead on the server side but in reality i think the overhead is pretty low and then we haven't run into any problems so far like where's like sigma uh derivative generation on the server side and then we can think of it this way so it's not possible to do the pre-materialization on existing segments so the only option would be derived column so basically the the latency overhead is not not very obvious um thank you so if we have any other questions please ask now um we can also bring you back jackie at the end if people have additional questions while you know we're doing q a after an eight house presentation as well but that was a really good question yeah cool okay so not to take too much time uh oh we have another question uh excellent feature does this feature work on real time consuming segments especially if my ingestion is done via my plug-in oh good question so yeah for this one uh similar to the we call the feature schema evolution so the new added columns basically derive column is also a new id columns so your new edit columns won't directly reflect to the consuming segment so in order to get the derived column generated um we need to wait until the consuming segment is committed once it's committed uh yeah once it's committed then the committed segments will have this derived column generated and then the real-time consumer segment actually supports the ingestion transforms so the next consuming segment will directly do this uh transformation during the record ingestion so basically yeah you just need to wait for the current consuming segments to be committed uh we do have two more questions that came in shopping uh thanks for the great talk jackie oh actually one other question has already been answered after i derived columns and trigger the table reloading can i start to query those new columns immediately or have to wait for the reloading to finish yeah good question so you will have to wait for the reloading to finish because before reloading reloading is finished pinot doesn't know or actually yeah the the the column is not physically generated so basically a server will drop the segment because the column doesn't exist got it cool thank you these are some really good questions um awesome so uh jackie will be available also after the q a will bring him back in um just in case anybody has any leftover questions but you can also join the apache pinot slack where jackie and neja are both there so if you see this after the fact that you still have more questions burning inside of you feel free to ask in the apache penis slack um cool thank you jackie this is awesome yeah thanks corey thank you for presenting every week week after week of all the cool things that you've been working on um all right cool so i'm gonna put you backstage and then i'll bring neha out um so really quick before i bring neha out i'm just going to give her a little introduction um so neha is a founding engineer at star tree she's also an apache pinot pmc and committer prior to her current role um as a founding engineer at star tree she worked at linkedin as a senior software engineer in the data analytics infrastructure organization um neha is apache pino pmc and committer and has made numerous impactful contributions to the apache pinot project she actively fosters the growing apache pinot community and loves to evangelize apache pinot in the form of blogs video tutorials speaking in meetups and conferences you can also find her on twitter at nehawar18 so neha welcome hey hi cool um so i'm just gonna back out and let you take it from here that's okay with you all right i'll see you afterwards for the q a awesome uh and thanks for the great intro by the way so uh welcome to my talk everyone this is uh pino managed real time to offline flows also sometimes known as the real time to offline job or you can also think of it as how to offload your real-time table operations to pino so that you can go and take a vacation so before we jump into what this feature is and how it works i'm going to spend some time talking about why we built this and to understand that we are going to take a look at the differences between a real-time table and an offline table and see how those differences create some complexities and overheads in the management and operations of a real-time table so quick recap on the pinot architecture so if you've already been attending these meetups and are familiar with pinot this might already seem familiar to you so we have the first component is the pinot servers pinot servers are what hold the pinot data in the form of pinot segments and then they serve queries of the data that they have and then we have the pinot brokers that accept queries from the users which they forward to the servers they get the results back from the servers which they merge and send back to the callers and finally we have the pino controller which manages all the components of the cluster with the help of helix or cluster management and zookeeper as a metadata store so we see two types of servers here the offline and the real time so let's double click into what each of these servers are responsible for starting with offline so typically if you have some data at rest lying in some file system like say hadoop or s3 or somewhere locally you would be creating an offline server sorry you will be creating an offline table and the servers that you use in this case would be called the offline servers and to get this data into your offline table you would typically write a batch injection job using spark or mapreduce and you would process this data one bucket at a time these buckets are usually like an hour's worth of data or a day's worth of data your batch ingestion job would then generate a pinot segment and then it will push it into the segment store and then the offline server pretty much has to just download this segment from the segment store and start serving the queries so the responsibilities of an offline server are download a ready-made pinot segment start serving queries that's it and because we use these ingestion jobs and process the data one time window at a time this naturally helps us create segments pinot segments that are aligned to the natural time boundary so why is this advantage is because there are certain operations that rely on knowing this time boundary or rely on these segments being aligned to the time boundary so for example if you wanted to do a backfill now let's say you had data for three days 14th 15th and 16th and you ran three injection jobs and each of them generated one segment called 14 15 and 16. and now if you wanted to backfill just the data for 15 and you had that data lying around somewhere and you created a new pino segment it would be pretty easy to just replace the segment number 15 with your new segment and then there's also because we have these ingestion jobs this is kind of giving us a chance to perform other data processing operations such as doing any rollups for a particular time range or let's say you want to dedupe the data for like for all the events that you have in the day or let's say you wanted to round your time granularity to something coarser so these offline jobs give you a chance to do all those operations now let's take a look at real time tables so if you had data in a real time data stream you would be creating a real-time table and the servers in this case would be called real-time servers so real-time servers have their life is much harder compared to offline servers they don't have it that easy so they are doing a whole bunch of things so the real-time servers are directly ingesting events from the stream they are indexing those events as they consume them this indexed data is kept in memory they are also serving queries of this in-memory data that they have and then periodically all the in-memory data gets converted to a pino segment on the real-time server and then the server has to push this segment to the segment store and it also retains a copy on the server to continue serving queries of that data so as you can see there's a whole bunch of additional responsibilities and these add additional memory overhead on the real-time servers especially the keeping index data in memory and building the pinot segment using the in-memory data and all servers are doing these activities for typically several partitions at a time so let's consider this example where we have a data stream that has four partitions and you can see that the consuming segments and the completed segments have pretty much spread across all the pool of available servers so because of these additional responsibilities and these variables that are present on the real-time servers and all this additional memory overhead the capacity provisioning and tuning and management of real-time tables tends to be relatively complex compared to an offline table also in real time tables it is not possible to create segments that are aligned to the natural time boundaries the way we were able to in the offline table and let's take a look at why that is the case so let's bring back our stream that has the four partitions let's say this represents the data in the partitions of our stream so going from right to left we have data starting from 14th of july and then 15th and 16th all the way up to 17th which let's assume that is the present day now suppose a real-time server started consuming the events of the pink partition how does the real-time table know when to stop consuming and when it should be building the pinot segment to flush onto the disk so this segment completion is triggered on thresholds and these thresholds are something like the number of rows consumed or the number of hours consumed so it's not really possible to create segments which are aligned to the natural time boundaries so typically in your partitions the segments that get generated would look something like this where they span across the time boundaries and will have arbitrary start and end times by arbitrary i just mean that you won't typically have a segment that starts on 14th midnight and ends on 15th midnight or segments that are like one per hour you won't have something like that same thing for all the partitions and another thing to note is that the segment boundaries they don't even match across the partitions it's like totally random so why is this a problem because we saw earlier that there's certain operations that rely on having segments aligned to time boundaries so in real time tables it's not possible to easily do a backfill on a specific time range or if you wanted to do roll ups or again for a particular time range or dedupe the events across like say an hour or a day that's again not possible uh and also we spoke about the segment completion thresholds right like number of hours and number of rows if you do not get the tuning of that right you can end up with problems like creating too many segments or creating very small segments or if you have too many partitions in your stream then again you're going to end up with a lot of segments and these problems will directly impact the performance of your query in the cluster okay so to summarize we saw that real-time tables and real-time servers in particular do a lot more activities compared to the offline server and so the management and capacity provisioning and tuning of real-time tables tends to be relatively complex compared to the offline table and also certain operations like backfill roll-ups tedious is not easily possible to do in real time tables so what does this mean for you and your real-time table so what if you do have a long retention real-time only table and or you need the ability to backfill a particular time range in your table and you want to be able to do rollups and dedupes as your data gets older so typically in such a scenario we would recommend that you set up a hybrid table in hybrid table you would create a real-time table and an offline table with the same name uh your real time table would have like very low retention so let's say like a few days two days or five days and then your offline table can have as much retention as you want and your real-time table will continue to consume you will set up the ingestion just like it was before and then you just have some additional steps and those additional steps are you would have to set up some etl to get all your real-time events onto say a remote fs like s3 azure or hadoop and then you will have to set up ingestion jobs to periodically read data from your remote fs create the pinot segments and then upload them to your offline table and while real-time tables work great there is one major issue with this approach that we just talked about and that issue is that this is so many additional steps for someone who just wanted to ingest events from their real-time table so a user is saying that hey all i wanted to do was ingest events from the real-time stream put it into a real-time table and are you telling me that now i have to write ingestion jobs i have to manage those ingestion flows [Music] so what if like i don't want to do all this i just want the simplicity of consuming from a real time stream so this is where pinot managed real time to offline flows will come into the picture with pino managed real time to offline flows as the name suggests uh data will be moved from the real time table to the offline table automatically you don't have to do anything you don't have to write your own ingestion jobs and maintain those jobs pinot will do it for you and this movement of data happens one run at a time where each run is like one time window so this is kind of simulating the behavior of how an ingestion flow would have done it and while moving data from the real time to offline the segments also get aligned to the time boundary this helps us eliminate most of the problems that we talked about which arose with having misaligned segments this also gives us a chance to apply any processing as needed to the data when it moves to offline so this is your chance to apply roll up stadiums time rounding and so on so let's dive deep now into how this exactly works let's bring back our data stream from before with the four partitions and the data and the partitions from 14 15 16 to the present day and we also see the pinot segments that have been created for each partition and towards the end we see the consuming segments for each partition so p note table is actively ingesting data from the stream and also has created some segments in the real time table okay so in the previous slide we said that pinot managed flows is going to move the data from the real time table to the offline table one window at a time so one run at a time and each run is going to be one time window so our first step is going to be to calculate the time window uh to do that every time we are going to try and find the oldest day the earliest day that has not yet been moved from the real time table to the offline table so in this case for cold start that earliest window is going to be the 14th of july so our first available time window is going to be start of 14 to the end of 14th our next step is to select all the segments that overlap with this time window so in this case in this example that's going to be these four segments that are marked here and we see that these segments are kind of spilling out of the desired time window for us but that's okay we're going to take care of that in the next step so in the next step we take all these selected segments and then we do data processing and segment generation so in this step we will filter out the portions of the data that we don't care about the ones that don't fall in our time window and with the remaining data we will create a new segment so this segment will now have data only for the 14th and then this segment will be uploaded to the offline table take a look at the second run so the next available time window is going to be the 15th of july next step is to select all the segments that overlap with our time window so those segments are the ones marked in this slide and finally we will process that those segments so filter out the portions we don't care about create the new segment for 15th of july and then upload it to the offline table to do this processing of the segments filtering out etc we use a segment processor framework that is also part of pino this is a very generic and flexible framework that lets you convert m segments to n segments this also gives you a chance to apply some processing to your data for example partitioning filter sort so typically if you had set partition filter or sort in your real time table this framework is going to be able to honor those same configs while creating segments for your offline table this also gives us a chance to do those aggregations and roll ups and all those operations that we said that we could not have done previously in the real time table so you can apply any dedupes or you can convert your time column to a coarser granularity you can roll up and merge the rows if you want and finally when it builds the segments it's going to use the same indexing config that you had applied to the real time table and your offline table segments will have the same indexes generated now just for fun let's take a look at one more run of how this is going to happen so our next time window is the 16th of july but as you can see from the blue partition we do not have the complete data for 16 flushed onto the disk as of yet so the segment underscore one underscore five that's still a consuming segment and it contains some data for the 16th of july and because of that at this time this time window is going to be skipped the next time this job again runs it's going to try and process the same time window so this helps us ensure that we do not prematurely build segments for the offline and we will only proceed with the next runs after we are able to successfully get all events for 16th and then keep going ahead one day at a time cool so so far we saw how this works conceptually how do you enable this if you want to use this in your setup to do that you would simply put this config in your real time table this config is telling your table that enable the real time to offline segments task this is a very bare minimum basic config but there's a whole bunch of other configs that you can put inside this map if you wanted to more closely control the behavior of how this is working so you have a bucket time period this is by default one day this lets you control the size of the time window that you operate on so in our example we were doing doing the movement one day at a time so that was because the bucket was one day but you can set this to anything you want like one hour two hours two hours or anything then you have buffer time period again the default value is two days and this simply means that the time window will not be processed unless it is older than the buffer time period then you have the merge type so this is where things start getting interesting uh with the merge type this lets you set this lets you set uh like a function so that you can merge and compact your rows and reduce the size of your data so there are three types you have concat d dupe or roll up so if you set it to concat there's going to be no merge it's simply going to just concat all the rows if you set it to d dupe it lets you dedupe your rows across the segment across the window that it is processing so if two rows are seen that have the exact same values across all the columns they will be counted as only one row and then finally you have the roll up so with roll up if the same values are seen across all the dimensions that you have they will be merged into one row and the metrics will be summed up and instead of summing up the metrics you can also change the aggregation type to max or min using the next configure so in this example we're saying that roll up the rows but when you will merge the metrics use the max aggregation function and then you have the round bucket config this lets you define how you want to change the granularity of your time column so typically you have milliseconds in your real time stream but when you are pushing it to the offline or as your data gets older you are okay with having just data rounded off to the closest r so this is the conflict that will help you with that and finally a conflict that lets you control the maximum number of rows that you want in your offline segment okay so so far we conceptually saw how this is working we saw the conflict that you would enable to uh start using this feature now who is orchestrating all this who is making all this elaborate set of steps happen across the cluster and the answer is the pinot controller with the help of the helix task framework and also a new component called the pino minions so let's take a look at how this is happening bringing back our cluster diagram from before i've added the new components here the pino minions and also the task queue the task queue is part of the helix task framework now let's say you created your real-time table and you added this config to enable this movement feature the pino controller is running a periodic task in the background and every time it runs it looks for tables that have this config enabled and now when it finds table underscore real time with this config it's going to perform the first two steps that we discussed earlier so the first step would be to calculate the time window that it should operate on and the second step would be to select the segments that fall in this time window and using this data it's going to create a task spec so the task name is real time to offline the table name the segments that were selected the windows start and window end this task spec will be put into the task queue and a minion that is free is going to pick up this task then the minion would download the segments do the processing create the new segments and upload the segment to the offline table finally the minion is going to update a special z node in zookeeper which is called watermark the minion will set the value for this table as the end time of the window that it just processed so that way when the controller will wake up for its next run of the periodic task it's going to look at this watermark and that's how it will know how to calculate the next time window for the run cool so we are almost at the end of our talk we saw how everything works we saw how everything works in the cluster we looked at the conflict that you would set to enable this uh and just to wrap things up uh would like to just conclude with saying that we were able to successfully convert our real-time table to a hybrid table using pilo managed real time to offline flows and with that we were able to solve our problems of backfill and the ability to roll up and detook and now because we have the offline table our operations are going to get a little bit easier and we did not even have to write our own offline flows to achieve all this and before i sign off i know this talk was a little dense in content so would like to leave you with some resources we have a documentation for this feature on our docs pino apache org please try this at home and let us know how it goes we are here to answer any questions if you have join our slack channel we recently hit a milestone in the number of members in our slack channel so we'd love to see you there as well uh you can also connect with us on twitter at apache pilot and that's it from me thanks yeah that's awesome great presentation thank you that was very dense it was a very dense topic i love your uh illustration that you did that's like i think it's my i think it might be my favorite one can you go back a slide yeah that's a that's a really fun that's a really fun uh illustration thank you okay so i'll put this i guess i can i'll put the slide back backstage and then uh we do have a couple questions that came in if anybody has any additional questions for neha for jackie um please ask now um all right so first one that came in um and some of these might have been answered or whatever throughout the thing but i've still put it in here um do we miss out on the real time aspect of data while we move from real time to offline ingestion right so we do not because even though we are moving from real time to offline we are still keeping the hybrid table right so we are still keeping let's say the very recent data in the real time table and then we're only moving like typically you would set five days retention for real time and then say two years for your offline so you're still getting the real time ingestion and the [Music] instant freshness slash that you wanted with your real timetable you're not losing on any of that cool uh mayak if i have a different indexing partitioning sorting configs for offline versus real-time table for the segments that are moved to offline which config applies here he gave a for example i may choose not to use star tree indexing for real time but may want it for offline right uh this is a good question but it's a very implementation specific detail question uh i don't remember i thought it was could apply exactly what it is in real time uh i can't i can't answer so sorry for jumping so yeah so basically when you move sequence from real time to offline uh when you generate the segment it will take all the off uh it will take the offline config so basically yeah you can change your index type for example on real time because kafka i will use low level consumer so it enforces you to partition but on offline side you might want to drop the partitioning so basically yeah it's quite flexible you can basically take the conflict from offline table yeah that's what i figured that we would typically honor what is set in the offline config just been a really long while since this feature was implemented um okay it's fun ask uh does segment processing framework have a python api or only java api it is only java okay so my he answered that he was just like my web for my uh for neha to answer but um there is a question that came in here that i think was answered but i'm just going to highlight it just to showcase it and if you guys have any other questions so guess we can't do ddup d d dub and roll up in one config since there's no way for us to figure out which events to dedupe and which ones to roll up um and then oops uh currently we support merge these modes are mutually exclusive if there's anything else that you guys want to add to that was this the question that and was my statement the answer i'm a little confused i think i think my young's question was oh no this exactly sorry sorry yeah so so max question is can we do do do plus to go up at the same time yeah basically the answer is no because they are mutually exclusive i mean concat is but we should be able to support roll up and dedupe together right i think that's valid i mean you go up you do uh you can configure the way like for example if you do max aggregation then you can achieve the same yeah yeah same behavior okay uh xiaobing says thanks for the great talk neha and nice doodle uh does this functionality assume the out of order data to be bounded yeah good question so uh uh that's why we had the conflict for the buffer time period uh so again it's like a best guess so if you think that you may have out of order data coming in for the next two days then you can set the buffer time as two days but if something comes beyond that then we have lost it so yes it does assume the out of order data to be bounded but you can change it with your config as per your use case got it um and then subbing asks also curious what happens if some very old data flows into real time table like days after the run for their time window was finished yeah again same problem i think this is related to your previous question yeah if it goes beyond the buffer time we're not going to move it but this is what typically happens even if you were to set your own offline ingestion flows right like if your offline ingestion has already happened and you have moved on and like and then after the etl let's say your new events appear in your real time stream uh you are on your own you're going to have to go and backfill it separately um okay key short makes a statement d-dupe happens on a specific key and roll-up happens on a bunch of columns so it is possible to do a d-dupe plus roll-up but not a roll-up plus d-dupe so order matters okay yeah this is similar to what we were in discussing got it okay um cool so um i think that if there's no other questions um that i think we might be able to wrap up like i said if anybody um has uh any additional questions and if you happen to be not watching this live um you can join the pacupino slack channel and you can ping neha and jackie there directly or you can just ask in the general or troubleshooting uh channels um we got some uh some good comments here rohit great talk neha and jackie nice t-shirt jackie awesome cool um yeah this is really good so thank you guys so much for taking the time really yeah thank you for organizing this and having us uh at least i'm a first timer but jackie's in our bedroom he is definitely a veteran and for me i'm actually pretty sure that jackie has done stream yard more than i've hosted any of the meetups and i'm here pretty much consistently because there was one that i missed last week so i think he's more seasoned than i am and i was we were the ones who set it up all right cool well thank you everyone for joining um jackie neha thank you so much for taking the time to present um and we will see you next week hopefully for the next meetup and uh don't forget to like subscribe comment and like um and then next week's meetup is uh apachepino and superset forecasting and visualization so we're looking forward to seeing that so thank you guys so much see y'all next week bye bye

Info

Channel: StarTree

Views: 253

Rating: 5 out of 5

Keywords:

Id: V_KNUUrS6DA

Channel Id: undefined

Length: 53min 12sec (3192 seconds)

Published: Tue Jul 27 2021