Building Real-Time Analytics Applications Using Apache Pinot

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] is there anyone who has not used LinkedIn okay no one is trying to escape from all these companies who are tracking you so if you haven't looked familiar with this page so if you look at all the things that are available on LinkedIn so you have a member you have jobs you have ads post company course most of you're familiar with all these things but there are these tiny little numbers that you can probably is hoping probably not so you can see some of the numbers out there and all the analytics that you can see on LinkedIn so today what we are going to talk about is how do we generate all these analytics out of all the things that you guys do on LinkedIn so let's look at what are we actually what is someone doing it on LinkedIn most of the time so the entire LinkedIn activity is modeled in an active verb and object model so you can think of a user Aris a recruiter or a representative as an actor and a verb is basically what you do on LinkedIn so you might either comment on something you might like an article or you can do any of these these actions on one of these objects right this is this is basically the fundamental unit in terms of how activity is tracked on LinkedIn in fact this is so important if you if you followed the recent news LinkedIn even added a few more things that you can do so you can now like celebrate love insightful curiousness so all these new things have come up and this is like really the bread and butter and lot of things are built on top of this so today's talk is mainly about this on how do how does LinkedIn take all this data that is coming in from various different users and how do we convert all these things into beautiful products that help growling help LinkedIn grow so the most the ultimate goal of any company is to basically get into this everlasting life cycle which is you build products the product generates data and you use the data to generate more insights and last but not the least you use those insights and create more products so you are always in this cycle and you're generating more and more engagement better better products so this is exactly what we did at LinkedIn we use this data and if you if most of you might be familiar with this if you have used any charts at LinkedIn on LinkedIn which is who viewed my profile or publicly or published an article and if you looked at how many shares where did it come from all these slicing and dicing feature most likely you're using something behind the scenes which is P know so P know pretty much powers almost 50 plus site facing applications on LinkedIn more or less 33 players are real real-time so if someone looks at your profile you immediately get to know that there is someone else the attic basically goes up and then you get the slicing and dicing on the data and you have talent intelligence which allows you to look at how the competitors are doing what is the distribution of your workforce and all these things so you have millions of Records here terabytes of data all these things are powered by you know there are also a lot of internal products that are powered by Pino so the experimentation platform the anomaly detection and also lots of internal other internal apps are also powered by about four by Pino a quick Android about me for those of you just got in so I pretty much worked on a lot of Apache projects and these are some of the projects that I've created over time but LinkedIn so Apache Hilux is the cluster management framework used to build a lot of various distributed systems you know is the ole up data so that I'm we talking about today espresso is the document store at LinkedIn which pretty much stores all over all the profiles and it's the source of truth third eye is another project that I recently started working on which is which helps you do uh nomally detection and root cause analysis on top of all the time series data recently I left I left LinkedIn on last year December end of last year and I wrote this article and I have to confess this was the first time I actually published an article on LinkedIn in in seven yes and he just took seven years for me to publish this so it was it was a really wonderful journey for me at LinkedIn and I loved every bit a bit of it and I really had to write this to just express my gratitude to LinkedIn so you kind of get all these analytics once you publish your article in LinkedIn so you know how many people viewed it's it's very funny that a lot of people from Google viewed it then LinkedIn but that's what the data says then what I think and then where did they come from what are their geo locations and things like that so the first thing that I will basically take up with how did we actually come up with this what happens behind the scene to power such a thing so in case you've in your company you have this Activity Stream activity data that is coming from your users what can you do with it and how do you build a product that can actually take the data and build more products so the use case the first use case that I'm going to talk about is providing analytics on this article views so let's start with the simple premise you you have a user who is looking looking at an article or he might view it you might like it share it do whatever our actions that I mentioned earlier the first thing that we have is you have a stream that is just saying someone liked something right that's that's pretty much the way every activity or event is represented in LinkedIn so the traditional approach of solving this is you take this data put it into a database you might have something like a member ID action and article ID so it's like what kind of action did you perform on this I on this article so you have the member ID and you also have the time so the other thing that you need to have is on the other side because you need how to slicing and dicing by the industry or where did he come from what kind of skills does he have which company does he work on that's not available in every event because that's too much to put in every event so you typically have a member profile which stores this is where in at LinkedIn this is stored in Nespresso where you have all the attributes of the member who viewed your or your article so the solution is very simple but this is kind of how the old old school or even today for if the data is not really large you can actually get by with this where you have an app you do a join on the fly so the query is slightly complicated but you can you can definitely figure out you say hey how do I join between the two you join on the member ID you pull the attributes of the member and then you you are able to group by the industry and then you get one of the widgets that I showed in the previous screen this works very well the couple of problems with this is it you can get real-time depending on the kind of storage that you use here so some of the storages you should be able to keep up ending it and then it should work and some of the cases you'll have to do a batch upload into these databases but the other thing that is really a big cornice it's the latency is very high because it's doing their join and and then if you look at the article and if an article is very famous and there are a lot of popular articles out so then it's doing a lot of lookup joints across the two and then it has to scan through lot of articles to say that what is the what is the number of views that it guards or do slice and dice on that the second option is little bit complicated but I'll walk you through that so the first part is exactly the same you get that data you put it into the activity stream and then instead of directly putting it into the DB you kind of join the data prior upfront so there is a stream processing framework that you put in the middle where you are doing a look-up on as the event is coming in not at the query time so you're kind of shifted your four or the workload from real-time to Rangers time right so that's a key difference that you want to take take away from here so you you look up the data wave member table and then now you ingest another event in to this now this event actually contains everything that you need it contains the member profile and all the attributes of the member along with along with article ID and the action and now the app is quite simple it doesn't involve any join but can just go through directly query on this single table and it should be able to aggregate and the latency part then the first thing the good thing about this is it's a near real-time injection the second part is I've mentioned it a low latency here but again I put an Asterix on that it's unpredictable the reason is now your cost or the time is proportional to how many how many rows you are scanning to compute your answer right so that's where I output I will get back to this on how pino addresses this but you can think of if the article is not very popular the time taken to show the analytics is not going to be a lot but if it is very popular there are definitely lots of things especially the ones that ask for a puzzle with how do you answer that that has a lot of comments and article alikes so those are going to take a lot more time to answer the query the third is taking it to the next level where you want to like get really really good latency but what we do here is you pre cubed so that means you are more or less computing the answers that you want out front so the only subtle difference between the previous one and this the previous one was just joining and then storing the data as it is so you can think of one record for every activities that you get but in this one record for every activity is fanned out into multiple records because now your pre computing for or pea cubing for all the different dimensions you might have one row for the geo one row for industry one row for so you are having a lot of fan-out so the problem with this is you are definitely adding lot more storage so that's the problem here which is it's very fast because most of your lookups is just a single row Luka so your latency very good and this is something that people do use like Kylene is for one for example one of the systems that does the P cubing and it's it's very fast but the problem with this is it's very hard for you to get real-time in this because you you typically need to end up batching batching the data and because you have to agree gate it and the the premise is that for every unique combination there is just one row and of course the the storage is going to exponentially increase as you add more dimensions any new dimension so tomorrow I say that I want to slice by the browser that the user was using or something or some other dimension you pretty much have to reboot strap everything and the query capability as well is not good because now you cannot do slice and dice or any kind of complex boolean expressions you can just simply do an and so it is limited in terms of query functionality so typically if you want to build something like this you have all these options that you can pick and depending on the scale and how far holes where do you think the product going you can pick one or the other but I will kind of summarize on what is that you will lose or gain in terms of picking up all these solutions so the leftmost part is where you started is the traditional solution where you would pick the join on the fly right so the flexibility is huge there it's amazing you can pretty much run any any sequel query on it but the problem with is latency so the complex the query gets the latency is higher so you will not really be able to solve this on a site facing use case as you move along this axis so now you have the ability to just do a pre join and a pre agree gate so the latency is getting much better here and but the flexibility is your slightly slowly losing all the flexibility because now you cannot really do a joint so you are just doing a single table query so the systems on the left side presto big query redshift they are all amazing in terms of giving you full flexibility they are all they store them as the raw table you can query them but you can't really expect a sub millisecond latency of millisecond latency which is what is required or a strong p99 value on the middle part you have something like Pinot druid elasticsearch in flux DB so all these are definitely the possible solutions that you can pick up today and if you go all the way to the right this is basically where I where you pre cube everything and then you can store it as it is so I put Pinot and this as well if your noticed so I put Pinot in two places I will get again I will get now as we get later into this session we will see why I have put Pinot in both the places but this gives the crux of what we can gain and what we what we are going to lose as we pick any of these solutions so at LinkedIn once we put all these things together this is exactly how the architecture looks like so as the streaming solution we are we use the cough car the the play LinkedIn is the place where it originated and then for the stream processing we use Samsa so that's where we look up espresso espresso is our database where we store all the member profiles and then it creates this new event which is the joint activity event and then puts it into Kafka and then you know consumes it and then you kind of get article analytics on top of that the app just simply queries Pino so what you saw there in the first it's basically powered by Pino and we have two years of retention but this is something that is configured configurable whatever the product decides and sometimes we might choose to just keep it for longer but kind of lose some of their dimensions and do some roblox so that was the first part of the thing where ice talk if you look at the lifecycle of you basically had the data and then you kind of generated a product and now you can also think about can we do better how can we use this data in a much more interesting way than just showing it to the user can we our self use it so one of the cool things that you can actually do with this data is you can now rerun our feeds because as you see the feed is always like a two side then grudges as a member I'm going and looking at it and the article is another side so you have as a member who comes and you have all your connections and you have all the articles that your connections have written or shared which ones do I show you right so that's where the feed matching or the ranking algorithm comes in so my simple option is to just take all the feeds dump it and then show it to you or we can rewrite it so this is where the second part comes in which is can we use all these attributes of the user and the article and then re-rank the feeds right so you can look at all these different things like the sum of the numbers how many views clicks or what is the age category and all those things of age or fanatical so you can see all these things and then rear ank your feed in terms of getting so one of the cool things in this is we we saw a huge increase in our engagement rate because of free ranking because now it's not static but it's looks at how many times you have even viewed the article so if you have already viewed it then it's automatically hidden from you and then you are showing in your new set of articles so every time you come to LinkedIn home page you might be seeing more interesting verticals so the architecture is pretty much exactly the same right so because the only key difference that if you look at it in this is the green things that you are showing is seeing in this which is the article data and the article table so now earlier it was just looking at the member table to do the join now we are getting the additional features from the article itself which is which category did it belong to what was the age of the article so that's where you have and the other thing is when you push the event to the kafka you are now having additional attributes about about the article so the query looks something like this and as I said one of the key things that is super important for us in terms of solving this analytics use case is the latency so we have almost close to 6000 queries coming in so this is not really heard of in terms of analytics use case so think of analytics but solving OLTP kind of workload right so we have 15 the 50th percentile is almost like as low as 5 milliseconds for us to do something like this and then we have a very strict SLA of 100 milliseconds and we cannot really miss that in terms of because feed is really one of the most you know one of the first things that you do as you log in to Li log into LinkedIn most of you might have heard of Pinot as Pinot and druid druid is very popular in this way in this particular space so a lot of times I get this question of why did why didn't we use druid why did we build Pino so the basic premise of why Pinot started at LinkedIn was to actually serve the site face in use cases and if you look at it even today druid is not really built for the site face in use cases so it's it's very good for the internal analytics I'll come to that in the next section but Pinot has lots and lots of optimization which is very very very much geared towards handling the site facing high throughput at low latency traffic so some of the key things that we have in built in we know that attributed to this performance difference is one is the sorted index the other one is the per query optimizer so we have this concept of optimizing the query based on the segment metadata so for every segment every small set of data the query can look very different it's very different from the traditional sequel databases where it's only the logical plan that is optimized in P no we actually optimized even the physical plan and we also have the concept of optional indexing unlike Druitt which needed indexing for every field Pino doesn't really need it because it has it knows how to optimize whether it's depending on the metadata of the segment so that's the first part which is how do we take this data and then generate more product you know more products and show it to the users the second part is what can the internal users do with this data right what are the business insights that we can get with the same set of data that we generated so what we have is not a tool in internal to LinkedIn called as Raptor which generates dashboards and insights for pretty much every metric every business metric are doing time so we have almost close to 10,000 business metrics that Arjun generated on a daily basis and it's the same data so now if you look at it you can look at where are the clicks coming from the views coming from by different countries so the dimensionality is much much much richer than what we have on the site facing so another thing that you can look at which is what how how is the grouping by different industries there any change in the trend is it coming from software software industry or is it from real estates like what is the distribution and how is it over time so these are this is something that is used extensively at LinkedIn internally similarly where did we get the views from where who is the reference so there are lots of different dimensions I won't get into the details of that but the key thing is we have this ability now to do the same thing for all that if all the various metrics at LinkedIn so these are all the business metrics that internal people annal look at analyze and then draw some insights and then they even look at is there a problem with these metrics whatever can be improved and on things like that there are like bunch lots and lots of data here and you have the ability to do slicing and dicing and then roll up and drill down so the architecture here is slightly different but it's more or less exactly the same in terms of the only key difference here is you don't really need real time in this we do have for some use cases but it's a no kind of an overkill to have business insights in real time for the internal internal folks for all the metrics for the key business metrics we do have the internal of a real time support that pretty much looks like the exact architecture that I showed earlier but for the offline use cases we have the two sources of truth so we have Kafka and we have espresso and the data comes into HDFS why I tool called Goblin so that's the ingestion framework at LinkedIn and everything is dumped into HDFS then we have this pretty cool framework called ump which is centralizes all the metric definition in the computation so the dashboards that I showed earlier all those things were simply expressed in a language which is we use either spark sequel or it can be Pig and you define the logic and you define you define the metric here you say what are all the dimensions that you need in this metric how often should it be computed and what is the retention I call all those things and then umpa just simply computes all these things on a daily basis or on a regular basis and then pushes Sydney to Pinot and then you have the UI on top of it so the previous screens that I showed you they were all coming from coming all the data is coming from Pino and computed in ump stored in Pino and then UI the Raptor is the the UI where you sell the dashboards so so this is another comparison between Pino and druid in this particular case so we basically took all these queries and then we we ran it across Pino Android as you see the p50 and the p90 is not too much of a difference and this is not really seen by a human ready for milliseconds versus 136 which is for a human in terms dashboarding application it's not that big of a difference but as you see the total time that it took for Pino to run the same set of queries is almost half of what it took for druid so there is a very interesting thing that you want if you can see at a meta level which is if you look at the same cluster now you can actually perform lot more queries using Pino than and through it so but if you look at a single query and then if you just look at the response time there may not be too much of a difference in terms of the perception from a user all right on to the next how are you doing on time so the next part is lo que não we have all these dashboards do we want people to keep coming and looking at this dashboards on a daily basis because they're just looking sometimes for the trend or for some one homilies and things like that what if you automatically do that for them like so this is where we this is the project that I was being I was working on I just called a third eye so this is we built an anomaly detection framework on top of the same data where we are showing the as the dashboards we can automatically start monitoring all that data so typically the what people do for anomaly detection is the Builder streaming solution which is you stream the data and you compute the models and then you you compute anomalies on top of the data so third eye is a very different architecture we actually use P know as the streaming engine so we stream the data directly into P know the data is available and then the anomaly detection is just a stateless system which we simply queries P know regularly and then looks at the time series data and then computes the anomalies so in this we this is this provides two things one is the anomaly detection and the second one is what we call it as root cause analysis so if there is a problem there are two ways to solve look try to do this root cause analysis on that one is you can have some hypothesis based on your previous things you might say oh the issue was in us last time or it was in this browser or based on your prior experience and then you try to do slicing and dicing at this point you are basically guessing where the problem is so this is a completely different way of looking at the problem so if you look at it on this side we have we just go over the entire data and then be able to look at all dimension in what at once and then in one shot we are able to say where the problem is coming from so the heat map is is helping you guide guide you towards the things that I have issues so in this case there was an issue which is there was a I think there was an increase of 5% week over week on this article and we want to know where is this coming from right so we have so many different dimensions how do we look into that so here it's pretty obvious that you just look at the the dark the big one which is blue and then the nice thing about this UI is you can click click each cell and then you can deep dive and then you can go into the next dimensions as well so it's very interactive so it's it's like sub second latency here even if it is doing all different dimensions so if you look at it it's not one query that is required that is you need multiple queries to render this something like this so for every dimension you can think of a different query and then you have to do a group by on differ across different times so it's a lots of queries that are running but with the ability that we have an Pino week we are able to even solve this so what's the challenge behind the scenes right so it's very easy to do an anomaly detection at the top level because you are just looking at the aggregate sum of that matrix but the multi dimensional anomaly detection is much more challenging because if you kind of unroll it this is how the loop looks like because now you want to try out is there a problem with any different any combination of issues because sometimes you look at the top everything looks good you may not really see an issue that is oh there is a problem with Android that may not really surface on the top so you want to do all these combinations so this second part is definitely challenging right because you are now I didn't try to look at all different combinations here and there is no one key column where you can easily look at a subset of the data you have to pretty much scan the entire data to compute the segregation I'll give you a very small quick example here on why this is hard so if your query was let's say you are trying to detect anomaly in u.s. you have to pretty much can like depending on assuming most of the clicks and the reviews are coming from us even though you have an inverted index inverted index will only help you find where or all the rows for us but the scanning which is the second part of aggregation which is the second part involves scanning all these all these rows and that's going to be slow but the same query for Ireland will actually be pretty fast because there are not many views that are coming from from Ireland so that leads to the second the lab the part that I was saying that I will get to it later which is the space-time trade-off like if you try to look at all these solutions whether it's P Q or P joint or pre aggregated it's really trying to trade-off between the two things which is your space and latency so the OLAP or the columnar source like P node in sort of druid and P node which is the similar in the similar form as druid without the star tree index that I'm going to talk about basically sits there which is the latency can be variable depending on your input so for example in the previous query you s can be very slow Ireland can be fast so you have a lot of variability here and in the other end you have this pre cube solutions which is which pretty much the latency is very good because it's more or less looks at a morals becomes a look up but then the storage is very high because your pre computing a lot of things so as the number of dimensions increase you the storage cost increases drastically so you have a steep curve here so what pino added is concept of what is called as a star tree index where it allows us to be somewhere in between so it can we can either go to that extent or we can be in this extent so without getting into the it's a smart database data where pre-computation so for example in the previous case where I mentioned about us it looks at the data and then it figures out automatically that hey you if you ask a query for us you have to scan so many rows so I'm going to actually pick compute only for us but not for Ireland because typically the explosion in the peak you bring comes because your PQ being for every combination without knowing it's is going to be expensive or not so the kind of Pino now becomes this hybrid solution where you can either go to all the way to this extent which is super fast or where play very similar to draw it which is just keep it in RAW format or you can have this knob which says for any given query what is the max number of rows that I can scan so the configuration is really based on you can say that hey for any given query in this form of a equals some value and B equals some value I should never scan more than 10,000 records so 10000 is basically your key there and that allows you to is the configuration part and that allows you to get the trade-off between how fast you want the queries to be or you want to don't want to compromise on the storage space so you just let it be in infinity so if you kind of look at that that the left side is basically infinity that configuration and the right side is basically 1 which means for any query that I ask don't don't even scan more than one which comes to the the key value part so this kind of shows the the impact of that because now we are able to run the anomaly detection and the dashboarding all of this on the same data set so if you look at it here you know with inverted index we could not really scale more than 10 queries per second with this data just with inverted because lot of these once you start having anomaly detection which becomes a machine generated query so the load is much higher but with star tree we are now able to get a lot out of the same set of machines right so that's the key difference you may not really see a difference on a per query basis so but a lot of these small small differences add up and or or a large set of data you can see a pretty big significant change now the architecture is very similar to the Raptor the only thing that gets added is the third eye on the top on this end which is the nice part about this is you know it's the same data that is doing anomaly detection and the visualization in terms of dashboarding so you don't really have this opera how to operate all these different systems and there is no data inconsistency this saying that oh I detected an anomaly in this but then now we're going to go and look at the data it's something completely different so all those operational challenges are removed you know usage outside so this is so uber is a big big time user of Pino so they have a lot of side facing applications of if if any of you has used uber eats the restaurant manager actually gets this dashboard for him to look at the analytics on all the overrides and that's powered by Pino then a lot of internal things at a tuber marketplace uber pool uber fried jump all of them are powered by Pino on the back the other thing is slack so slack is a one of a one of the users as well so they were using Pino druid earlier and if you look at the message from them they basically moved from a 45 GB 45 note to 18 note and they still have a lot of capacity so that is what I was always referring to that you can actually get a lot more out of the set of boxes that you have from P nodes and to it it's funny both slack and Microsoft teams both of them use Pino so um conclusion side I think the key takeaway that I would like you guys to think about is all of you have most of you have this activity data and then lots of varying kinds of applications that you want to build on top of it you might have a site facing application or internal or anomaly detection and these applications can become start very small but can become very popular later right so the key thing that is most of them do end up doing is have different solutions for each of these use cases so you might use a key value store for one an OLAP store for one and the stream processing for another and this results in lot of operational challenges and having to build and maintain all these different systems so our goal is was basically to build one system with LinkedIn that all allows us to build handle and take care of all these very variety of use cases and that's basically what we what we were able to achieve at LinkedIn that's pretty much it I have few minutes of for questions and then if you are interested in using Pino at your company feel free to chat with me I will be able to help you on what are the things I mean the documentation part of Pino is really bad that's something that we are working on but feel free to chat we have a slack channel which is active so you will definitely have the we are also on Apache Incubator so we will definitely be able to provide help in terms of using it that's pretty much it thank you again great talk so I had a question you spoke a lot to the effect of latency in terms of how long the query is actually taking to run but then there's another level of latency I'd imagine from creating these pre comp these pre computed views if you were querying like the raw data directly you didn't have to go go through those steps of aggregating all the data is there are there any considerations being made for the amount of like how behind quote the data is from when it first streamed in by virtue of making all those roll-ups alright anyways it's it's not a it's not a problem because Pino provides the real-time injection so it's it's more the streaming stream processing engines are also real-time so it's a matter of few seconds so for example the profile views that I mentioned you look at someone's profile it's within five seconds it's generally there and most of the lag is coming from the Kafka hops in different data centers but inside of Pinot itself as soon as the data is in best ingested it's available for querying I do get the second part of that which is or we are doing this joints and is it needed so that's where you want to have the question for the product which is what are the things that you can gain out of making it really fast from the user by precomputing it so linkdin is a classic example where we build a lot of these applications like 50 plus and the product managers know the value of how it drives the engagement back and what's the value that you're providing to your customers so it's really a trade off whether you want to do it for everything obviously not that's where as I mentioned we don't really have real time for all the internal metrics we have almost 10,000 metrics so only few of them have real time so it's really a conscious decision on are you able to get a value of it we have the capability maybe there aren't all the use cases shouldn't be doing that thank you [Music] what is the what are the data size of you were often analytics final DB so we have more than hundred terabytes of data for the internal ones because these are all after some after sometime we roll up into a day level the real time section is as I was mentioning for the interval this much lesser for offline ones we we keep the daily for almost two three years and depending on the retention so it's on this 100 TV after indexing or before indexing it's after indexing before indexing it's probably like in close to a petabyte or something a lot of compression that you get by indexing and storing it in the Pinot's the dictionary encoded format if I may ask one more question so this 100 terabytes right are you guys storing is offline somewhere on this DFS are like no no pino has its own local storage so it's stored on the peano service itself it's now we keep it in offline just for as a backup but it pulls it in and then it sort of locally Maps so we map it we pull it locally and then VM so how does it fair for so if we have a use case where you have lot of different say matrix which has each individual diamond like Iowa will call us as event so there you have number of different type of events which have their own metrics and dimensions and we have to do is like aggregation like hourly ten minutes or a daily aggregation how does it affairs like is the P note billed for these kind of use case yeah definitely so whether you want to do it on the same table or on different tables that's up to you so one option that you if you want to do it on the same table is you create multiple time columns so one is on a ten minute roll to that the nearest ten minutes and the other one is nearest hour or nearest day and then using star tree you can it'll automatically create aggregates for the days and for the for the ten minutes or four weeks right so then when you ask for a query which is for a particular ten minute licorice then it goes to the subsequent leaf level otherwise it looks at the higher level so you get a much better so that's where it's the per query optimization is very important so it's it doesn't have the same plan for every query and that's where a lot of the performance improvements that you're seeing comes from that you have that ability to optimize on the per query basis one more quickie and then we get a rapid um can be pino requires helix as a cluster manager right yes is there any plans to support other question managers like kubernetes and other stuff I think those are two different I mean I worked on helix so I can give you a sort of an explanation on how kubernetes and helix differ they're actually complementary to each other the way you want to look at it is all these containers I will lifecycle so they have one lifecycle is just starting and stopping that's where kubernetes comes in which is provisioning and starting and stopping the container but there are a lot of other things that happen once they start there is a partition and there is a replication and all those things which kubernetes doesn't really understand so that's where helix comes in so if you look at helix and kubernetes they're actually complementary to each other okay so we can use kubernetes for Pinot but it's independent of whether we want we don't have to get rid of helix to use kubernetes so they're all they're independent of each other [Music]
Info
Channel: Data Council
Views: 7,253
Rating: 4.9333334 out of 5
Keywords: linkedin data science, linkedin data infrastructure, apache pinot tutorial, apache pinot, real time analytics with apache pinot, real time analytics using apache pinot, Kishore Gopalakrishna, Kishore Gopalakrishna linkedin
Id: mOzjVRf0yt4
Channel Id: undefined
Length: 42min 5sec (2525 seconds)
Published: Sun Jun 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.