Geospatial Support in Apache Pinot

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey everyone uh thank you all so much for joining um i am karine wallach i oversee community at star tree and today we have a very interesting presentation on geospatial support in apache pino our presenter is yupeng fu he's a staff engineer at uber he leads uber's real-time platform and infrastructure including multiple mission critical services powered by several open source technologies like kafka flink and pino so we're really excited about seeing his presentation um so really quick just some lay of the land stuff if you have any questions during the presentation feel free to ask there will be some folks here from star trek and apache pino uh contributors and committers that might be able to answer those questions if not we will also be doing a q a immediately after yupeng's presentation and if you happen to be watching this on demand and not live after the fact um you're also welcome to join the apache pinot slack and you peng is there and there's also some other amazing people from the pinot community who can also probably answer some questions too um so yeah without further ado welcome you ping hello everyone um thanks for having me um so um today i'm going to talk about geospatial support that added into apache keynote um and uh coincidentally we just published a android blog on uber engineering blog series i will just place a link in the channel so if you're interested i can also go there and take a look so i will cover some of those in this talk as well um to begin with about myself um i'm so engineer at uber i lead a real-time data infrastructure um so this infrastructure includes many streaming radio systems like kafka and fling and for the real-time online we use peanut note i'm also a committer of several open source products uh including pina as well um so let's begin with the like why we want to either geospatial support and also why we want to add this to uh apache keynote in fact so uberspeed is highly real time and geospatial data related in nature i know every day if you open up your uber apps or over each app um you can there are like tons of information like send over the wire and we actually collect information from drivers riders restaurants eaters fruit carriers merchants and those are valuable information and we use this information for different like purposes like flaw detection recommendation and try to derive important insights and provide value to the end users so just give you like one example here that we also highlight um like it's a one use case to show why this is real time useful and why your special data is useful in this case uh if you open up your uberex app if you scroll down you will see on the on the home page there will be a closer called orders near you so this closure will display people around you older what kind of like dishes in the past minutes uh so purpose this is to provide some social engagement um to the ubereater eaters um to you know let people be aware like what's going on um around themselves uh in the real time this just like one example uh there are many other good examples to show the geo information is very powerful like for example how many drivers or riders in a given geolocation this can also be helpful for demand supply balancing and fulfillment so back to this query to show these orders near your case um so i will talk more details in this talk but i want to quickly show you that um this the entire this feature for the data backhand can be simply described as one sql query like the one below um you can think of that there's one cover topic that for people who place orders of this uh this backhand survey will send this information to this orders table and we set up a pinot table over this kappa topic and to retrieve these nearby users in the recent time it can be as simple as this query with a no where clause you can see there's a function called sd instance a distance which is a geolocation function that takes uh one location st point which would can be like the um on the other order's information and also the current the user location uh which can be like a parameterized query uh function uh to this and then we just like check the distance to see if they're like the orders in yearbook and also we can put like a time series step on this uh so this is very useful for the product teams because um the the background can't be a built as simple as running one sql query and we leverage apache pino to take care of the system related complexities of like uh rapidity viability and scalability and as and therefore inappropriately of the uh productivity for the product teams to build applications um so in this hour let's go walk you through on the entire like geospatial features that we added to keynote and some other details on like what is how we evaluate these functions how can make this quite really fast and scalable before talk about the um the juice build feature just give you a very quick overview of actual pin on uber um so the real-time platform at uber uh this is very popular chart on the nintendo is trying to um perform antics on the most like recent time and to help the application to make time critical decisions and we believe for the events time that ice time goes by the value decrease so this is very straightforward and there are many good examples to show how we can bring um the value of from this events when they are like very recent uh a few examples to show uh the major categories of these real-time use cases so on the left hand side you can see uh this is uh how uber does the search pricing uh it's a demand supply um like uh change um so reflect on the uber price or uber rice price um so the purpose is very straightforward is to encourage more drivers uh to the location that are higher that has higher demands um so to make like a balancing of the of the uh the driver rider marquis using price another example is for the external antics so this reflect as dashboards on the second direction see this is called uber restaurant manager so this shows as application to the other restaurant owners and they can see the most recent performance of the of their restaurants including the sales the popular dishes uh the feedbacks from the users everything just like in real time um the rule time also are very important for the machining use cases so if you have used uber apps before the third screenshot should be familiar to you is that how for the uber rise over eats we do the delivery estimates um and the um so from like uh time you can see like if your order will drive in half an hour also also they are like these are like predictions made by machine running models uh in addition to that i will also use real time like monitoring to as reinforcement to a machine model to see if this model predicts correctly or like effectively and lastly for the uber internal users and data scientists we also use real-time data for the exploration and try to derive insights not only on the historical data but also on the real-time data um for the uber real-time platform the core is apache keynote um so uber has a talked about patch keynote for quite a few years and based on our observation pinot has also a very good option across industry in the past years and has a very good validation on its scaling performance acquired latency um so far we are very happy with our choice on this um to add more on on this part so you can see this skill numbers it's kind of like reflected on the skill like of keynote we run at uber as you can see we chose primarily for uh it's high qpi support low core latency and very cost effective we did um local analysis and a study on the cost storage efficiency side you know um because uber is um is working really hard towards profitability so cost efficiency is indeed like a company priority and we have seen uh comparing to many other uh rule-based or dock-based storage for example electric search a pinot should agree cause efficiency um and costal reduction reduction um we actually published a paper in sigma this year which is a conference on the database and data management so this research paper documents about the how the uber real-time data infrastructure is built and the major components which also includes this kind of like these design decisions and how we chose different open source technology as a building blocks up so feel free to take a read if you are interested so the link is placed at bottom of the slide um for use cases as mentioned that the in earlier slides that it powers a number of different uh external and internal um like uh platforms and use cases such as like dashboards um we also invested quite a lot building the ecosystem around apache kingdom and uber so this reflects on how we improve the user onboarding through a ui based uh workflow management um so so that people can just like go there and click a few things and clear some boxes to onboard to the their pinot table on the query side we did the integration with the presto which is also a popular qualified education engine originally from facebook we published some blogs in the past talk about like how i made this decision and how this presto helped like integrate the p node with the rest of the uber big data infrastructure um so let's talk about like the geospatial feature so we contribute this feature um last year in 2020 and we actually spent quite some time on the design phase to see how we can um add this geospatial feature uh you know future proof way um so to further uh explain this i would say geospatial is actually a special topic in big data management um and it's a pretty complex domain in fact there are like sql standards built around this uh you know there are many commercial databases uh uh like um like try to monetize this area i think one of the reasons like this the geospatial is a complex topic so in this part of talk i would many mention a few challenges that that we observed when we try to design this feature one of the challenges that the geospatial has its own data model so in addition to the plain data model like in the database which are people for mirrors like you have a string integer you have double values but for geospatial it's actually have abstractions of the geometry hierarchy so you can see this chart which is copied from ogc is open geospatial consortium it's also like a standard uh it has a hierarchy and has a like subtype system so you can see there like um like uh concepts like points curves like surface polygons and those are like a collection of these data uh structures um so let's give you a few examples like how you represent some like primitives right how you define like point how you find like a line a string and a polygon it has a like reach um standard definition of this um so data modeling is definitely like one example um and the other challenge like coming from this is that how we can perform this um data definition and processing like efficiently how it can encode this uh um this values so like to show you one example like how you take like uber right um and as a raw example how they can be represented as like as lines using this geospatial formats so for the multiples is also a collection of geometry like uh types uh people may be interested like why we need this um so um usually it is useful to capture the result some operation like for example intersection of difference to give you one example like if we take a line and some polygon if we do like intersection so the results could be like a collection um another interesting part is that when people talk about like geometry uh uh they are actually um two popular uh like coordinates one is a cartesian base the others is a spherical um so for creation basis is very straightforward it's like planar uh coordinates uh but the put the spherical is based on the where's the shape right uh i use it um like a different way uh to encode data which is like longitude and latitude that we are familiar with um and often these two uh cannot be used interchangeably like just give you a quick example let's say if we want to find the flight law from vancouver to paris um on a planar coordinate system uh it's gonna be like a line right so it's a the shorter distance however if you take the earth if you look at how the the plane takes the the rod uh it's actually not this is a it's not like a straight line on the planet coordinates so there are additional changes to take care of this and there are like active algorithms are used to calculate the distances or areas differently um based on what kind of like a coordinate like system is used um and because of complexity of this geospatial like hierarchy and type system they're actually a different like formats uh trying to capture this uh we just released a few like popular like standards um and in uh in apache keynote we actually took the vector-based formats which is the most popular one and also like widely adopted by many other big data systems um the third part i want to talk about is the uh geospatial indexing so this is mainly used for compute efficiently on certain operations or functions on the on the geodata for example you want to do like a join or we want to calculate the distance um like one good example is in the beginning slides top of this order scenario like how we can make this data retrieval more efficient uh there are like interesting and challenging topics uh even in the literature so you look like these existing and past indexing techniques there are like different approaches uh towards this problem and it's quite well studied in the database um like community so to give you like a sense on this uh one popular approach in the uh in the past is called like quattri is a way to split um is to group the data points using the rectangles and then and to build this tree structure on this um but at uber we took a different like approach so even before we work on these geospatial features on peanut uh there's uh uber already made a really good open source library called h3 it's a great library which takes like a hexagon based approach so uh to give you like example uh to see what this means uh so on the right most of this you can see for how we upgrade the the geospatial space using the uh the hexagons i can see for each has gone there are like uh six neighbors and this uh this this uh approach has some unique properties comparing to other approaches uh the gradients like square based or triangle based the important properties that all the neighbors has equal distance which can simplify a lot of the uh geo-based distance calculation environments and also this theoretical grid system which means you can have like a different resolution and this kind of like it's kind of like trade or between the precision um and the computational latency and complexity so let's do one example is like how you can use this grid system to approximate uh california uh and how you can see how we can split like san francisco area uh using the using the grease so in the later slides i will show you um how uh pinot integrated this with this h3 to build the geospatial indexing and how we use this to speed up the queries one thing i want to keep in mind is like the hierarchical grid system is uh is different from the circles but they are just trying to approximate circles um so this presents some unique challenge uh how we can compute the some exact results uh and also in the literature side i will describe how we overcome this challenge all right so um let's talk about how we build these geospatial features and addressed those challenges when we mentioned the earlier slides um so the for the data types in pino we actually added a full support of the geometry type hierarchy so just at least a few examples here to see how you can create this like geodata structures how you create like polygons how to create multiple lines if you're interested i'll go to the pinot documents to show uh to see some more details um and um we one different style like we did not add the this kind of logic types into pinot type system but instead we try to reuse the bytes type and that means all of these uh functions will create some bias encodings and we leverage open source library called gts which is also quite popular in the industry that like for spark or postgres also use this library and for the data serialization and deserialization um so we provide like functions to convert this uh this uh data structure like from different like formats so there are many two formats these ratios are like called well-known text wkt and a well-known binary these two ways are represented as like two functions and the is also according to the sqlm standard now in addition to this we support both the geometry and the geography remember the two coordinate system i talked about and then these two can be interchanged using the corresponding functions so the purpose of this is to um support uh different kind like geo use cases and geo functions but keep in mind like some of the functions can only be applied on geometry and geography i can get more information about this on the piano user docs and now let's look at how we can ingest the geodata right so it's common to see that in your like raw data or in your events you will have some of the information about your location most common cases like you have a longitude and latitude and all you need to do is to leverage this a transform function which is another feature in p node data ingestion to transform this into a geodata particularly you can see you define the longitude and latitude and then you create st points and now we want to use the geography like format for this so we can call like two uh spherical uh geography so all of these are built in transform functions you can just like directly write this into your data config okay now we have the geospatial data available in keynote uh how can we acquire those um so we support a number of the geospatial functions um so we actually take the iso standard sql mmm so this standard defines all the function startups as sd underscore which is like spatial and temporal so if you're interested you can go to this website to see the full spec of the sql functions uh one thing i want to call that um so i would say um unlike many other uh root lab systems like p node um when we try to approach this problem uh we want to be as to conform to the standard as much as possible mainly for the the use of like maintenance and the sustainability of this feature and also more importantly that inside uber we also have other big data analytical use cases and we try to make sure for the user and data scientists who are familiar with the sql standard can also easily use the functions on the pin as well so instead we did not uh i would not try to you know like rush this feature in with some customized dsl or like syntax but instead we would rather like take the video like slower but might try to make sure we confirm to the to the standard um so this the geo functions can be grown into different categories um so there are like constructors for example how you can create this from the um like wpt or w uh kb i mentioned in the beginning and also there are like measurement functions calculate the distance and area we can also like store this dual data and see like this as different types of output so the link below uh should also points to the corresponding section in the user guide you can take a closer look at the the signatures all these functions um and what they what it it does um i won't call that like the the geo functions actually uh they're there a lot so if we take a full uh look at the the spec and alpina we implement a subset of those which are commonly used geo functions but there's still many more functions to be added so if you are interested in contributing to pinot i think that's also a good idea to extend your help all right so let's take a look at this older new york case and to see how we like leverage everything we just talked about to uh to run this query um so this leverage the um the distance function and also the like data points creation function so a nice execution of this query is pretty straightforward is that you just take all the order and filter by the time and then you evaluate this function and then this can um get the uh the nearby orders that you want but one thing i on call is that because of the complexity of the geospatial functions typically a geo function is much more complex to evaluate than other sql functions like fridge and temple like arithmetic functions um so if we just execute this query naively uh there there's gonna be like some performance penalty on this um but looking at this use case uh this actually have a runs in every single uh uberex app so you can expect there like a lot of queries to hit on on the pinot servers and therefore it's definitely our interest to make sure this query country can run really fast so that we can take more concurrency so to improve quick performance on this we actually invest a lot in the geo indexing support which is definitely something interesting and unique i would say in keynote so um so let's take another look of this query so assuming we are at the center of the san francisco um if you take the uber's h3 grading system uh we can we can grade the san francisco area like in this great um and assuming we want some like uh a distance which is can be represented as a as a circle right um so we want to retrieve uh all of these uh like data points within this circle so one quick way to do this like instead of scanning the uh all the orders across the globe we just look at all of the like orders uh in this like blue hexagons which is the um the data if we have we have the h3 index um so it'll be going a little bit deeper into this h3 so h3 is a library uh it means that if you give it like a real geolocation like for example longitude and latitude you can construct a h3 index which is like a hash of long time um so one way to represent this you can see um we can like uh library some of the geo functions from this h3 library to help us to achieve like identify the relevant icons so importantly there's like function called caring it means that if you take my current location assuming this is the center of the word then we want to find all the hexagons like nearby um so for the carrying equals to one you can see uh we have like a six high scores around it right uh if we take like carrying equals two then we want to find the next circle like around it so this is the way how we can find any like index within the k distance of this original index so how we can leverage this so the accuracy algorithm i would say uh is also be like relatively straightforward once we understand how the industry library works um so assuming we have the center at the san francisco um then for all the high scones that are fully contained in this circle in this like right circle then we know uh these are the data points definitely we need right so that means we do not need to evaluate the points within these high scans in this example uh is that all the highest count the distance is one uh sorry i think the the red circle little misplaced they should be like moved a little higher but i can't get the idea here right so um and for the points that like falling into the history instance uh which is like i see the the second circle that overlapping with this um right circle for those data points then that means we need to do some evaluation so remember in the earlier slides i've said that the h3 and hexane phase approach only can approximate a circle um so here to ensure we can get the exact results we need to take all the orders from this circle and i'm going to run this as the instance function for data points like within this index uh this is actually a great reduction of the data that that we need to run the functions over right so the reduction is like from all the data in any place of this word to only there's a second like ring of the circle and we indeed observe great performance improvements using the geospatial indexing so for the dual indexing creation is also actually pretty straightforward so on the left hand side you can see this is the same data ingestion confusion that we have in the earlier slide and we can create this um index um just say you know for this um sorry i think there's also a table here that should be like the group location uh we can say the index tab is h3 and we can list the resolutions that we need so some quick comments on the resolution so h3 has like different number resolutions from 0 to 12 um to control the granularity and for different resolutions like we build here the icon determines how fast you can retrieve these data points uh in exchange for the storage which you have to build the index and then you pay extra storage cost to speed up this query you can refer to this user guide to learn more about how to create the geo indexing um so i won't take a further look into this all this new euro because it's a very interesting use case and it shows many good um like uh like gains from uh the using of hp note as a storage choice um so when the production first built this new orders near you use case originally the team used cassandra as a storage backhand the reason is that for the uber use orders when people place each order the orders are stored in cassandra so a natural idea from the team is to reuse the storage and reuse some of the existing back-end api so there's a server is called like order gateway for the order process not only for write also for read so for each of these orders a new you request um the this feature runs to retro retrieve uh 20 from 20 uh nearby restaurants and then do some like sorting and find out the relevant orders which means for each incoming user request this translates to 20 requests to this older gateway and for each order giveaway this this will invoke six kb read on cassandra and now you can do the math right so for each user request this translates like 120 requests to cassandra and this will add a lot of pressure on the custom driver backhand and the team did estimate that they have to increase exiting cassandra capacity by 6x to support this query load um so the team re-architect the entire backhand solution and this time they um use p note um so um it's actually a pretty um efficient solution you can see like for each of these requests this only translates to one sql query uh heating p note uh not only we a lot of backhand servers but also we observed very good qualitative improvements and before there are so many remote rpc and back and forth calls it took seconds to fulfill this request and now with pinocchio we can only take like less than 50 milliseconds uh which is great reduce latency reduction um also pino have this a really good imbued feature on the to support horizontal scaling uh so uh so if you're interested about this login technique there's a link to the pinot user dock so the this launching mechanism is called like replication group so you can see on the diagram on the left by default pino will share the data into different servers and the each server will have um like one copy of this this data and in total for each um segment of this this chart uh by default they're like three copies uh so by default the the query router will take the query and it will do the schedule together right so it will dispatch the subquery to the corresponding like servers which by default is like uniform to make sure the query load uniform across the servers but this default approach has a few like issues to support high qpi use case one is that the the default number of serialized replicas might not be sufficient when the qpi is high because i most can't have a three servers to serve this query of this particular segment on the other is that if any of this server is low up is slow uh then you can rank the situation that this will make the entire query slow uh because of this is a bottleneck and this results in the long tail latency issues so the technique to resolve this is to leverage call like replica group so each treble crew group has a full replica copy of the entire table which means that for any given input query we can dispatch this query to one replica group and to fully fulfill this query and all of these replica groups are they are disjoined um so the way to scale the backhand is very easy is that like you increase the number of the replica groups as you'll see more and more qpi's incoming so this is how we've achieved the horizontal scaling some additional insights that we gained from this use case is that which is a separation of the ltp and oip so for people who are familiar with like database technology or big data ecosystem this is actually a pretty common paradigm for the batch oriented data so they are like popular otp engine like for example mysql or cassandra that takes um business critical like write operations uh in this particular case that is served as the persistent store of the uberease orders that we do not want to risk this storage system um but for the analytic online analytical queries um the uh the they have their own unique like uh query pattern and the notes and typically this kind of like greed query has much higher volume than the red queries so by setting up this separation um using like kafka as the event log we can ingest these changes into a p note which can be used to take all the anecdotal queries not only this isolation provides higher rapidity and fully protect the otp storage but also we can easily scale these servers up and down based on anecdotal query loads um and this also makes the core latency much better because the ol ap engine is designed for the aniko queries it is a column door store uh and also has different kind of like index systems so which is quite different from the traditional transactional store like mysql or like or kiwi store which are not designed for the technical purpose um so i'll share like a few uh useful links uh these are um geospatial features that we contribute uh last year um so um the the first set of features was released in the 1.6 and the dual indexing was released in the 0.7 um you can find the user guide and also the design dock on the pinot websites so kenny from the community wrote actually pretty good introduction blog on this uh to introduce these features uh and today from the uber engineering side also publish um a blog to highlight these orders near you and some other insights we gained uh from the this use case and also to introduce the geospatial features uh i also want to take the opportunity to thank uh like uh jackie and kishore from the pinot community uh who helped a lot on both design and engineering so without their uh help we couldn't have made these geospatial features available in pinot so soon so i will stop here and happy to take questions thanks you peng that was awesome thank you and thank you for all your work on this too and for everyone else that you shouted out thank you um so there is a there is a slight delay between the video and the q and a's so um if anybody has any questions please ask now um you can ask directly in the chat um we do have a couple questions that already came in um okay so mayank asked curious how the cassandra cluster size compared against the pinot cluster size the number of nodes yeah so uh i cannot disclose the exact size of the cassandra one um but i would say the consumer size is definitely much bigger than pinot cluster um and uh also if we time this by six it's definitely uh i would say like over like wilder magnitude of the king of cluster sets [Music] nice man i hope that answered the question uh my aunt's making a joke he said i know what you're thinking asking a question instead of answering for a change because he's usually in here answering questions but he answers the question so well that we get questions during the event and then by the time we do q a there's no questions left because he's um so matt asks how do you protect rider location from staying in the database for an extended period are there any gdpr implications in the in your use case yeah so that's a good question so uber is very serious about the gdpr and the entire uh data lake at uber uh confirms all the gtpr requirements um but the for the um like real-time data in this material in this particular case so for the other data like ingested into this is for the analytical purpose and we shield all the sensitive information from this so i would say this is actually another great use of this separation between ltp and oil ep setup because for the annual queries we do not want to directly heat the transactional store which has more information than necessary for the anecho purpose right but when we do the data ingestion using kafka the engineers can be like conscious on like what information are needed and only pick necessary information for this particular case um so hope this answers your question yeah it kind of sounded like i'm not sure if it's two questions or if it's one question um because protecting the writer location from staying in the database for extended period that's that you basically address that too right yeah so um also on this like this uh like they are like um real-time like information um so that um and on the pinot we can configure the uh detailed retention like how much did i want to keep this um so for this pokemon uh for this program like we just need like for a couple hours um to serve this type of like query load [Music] got it cool um if anybody else has any other questions please ask now um and if you're if you happen to be watching afterwards some people join a little late and they start watching the video from an earlier point so they don't make it to the q a right away but you can always ask in the slack you can ping youpang directly or you can just ask in the general channel yeah feel free to reach out to me uh i'm also available on the pinot like slag channel and i would say there are also interesting work on the geospatial part so if you have interest in this area i will also follow up on the on the github shows that's awesome yeah that's really cool that's definitely a big part if you're interested in like being involved in it um cool so i don't know if there's any other questions coming in sometimes they like flow late because of the delay um but i think we might be good we are having some thank yous there's it was a very awesome presentation very awesome today he says thanks used to you paying awesome work i definitely second that my aunt said thank you um the feedback is really awesome so for all the attendees if you do have any kind of feedback that's specific it's it's very helpful to get that um so please share it um rohit says great session yuping thank you appreciate a lot of thumbs up uh make sure all you attendees out there are subscribing commenting liking um definitely give this video a thumbs up if you um enjoyed uh youpang's presentation thank you cool you paying thank you so much for taking the time to bring all this stuff down very very informative thank you that's awesome can't wait to see what more we where this scope is cool get we got a we got an applause emoji as well all right thank you so much for being with us and um yeah we'll see you in slack thank you um have a great day folks oh and also i mean did you want to did you want to mention the blog too or oh yeah so um i signed a blog to the select channel um so uh i can also i'm not sure like uh character can see this on my screen uh i can add it to the stream so yeah oh yeah so if i can post those yeah we can post it on the slack we also have another question that just came in too oh sure see sometimes the delay is weird it's like the questions come in super late like we were about to close it off sure um that's right yeah regarding what was here um yeah about the group by so yes so the uh the the geospatial are like uh common uh penal functions uh so that you can group by function results um but just keep in mind that the uh some of the geospatial functions are complex to evaluate um that's why we added the geospatial index support um like especially when for example if you have like polygons like my favorite example is that you have a a city and the city they have like thousands or even like hundreds of thousands like points to draw the boundary or city and you want to say oh whether even their points contain the city it can take like seconds or uh to evaluate this right up so commonly there are like uh techniques to generate um like a city id or country id associated with the events uh for quick uh filtering or evaluation this people also do this as part of the pre processing as well um but for this uh workhorse is more like a general purpose uh geospatial support um so that you can also leverage um as like different kind of like geospatiality like query word query load i hope this uh answer the question cool um yeah i think that that was pretty thorough i think that that makes sense but i also included the url for the blog in the chat i dropped it in there yeah cool you think thanks again this is really awesome and very informative thank you have a good one bye
Info
Channel: StarTree
Views: 148
Rating: 5 out of 5
Keywords:
Id: _9VB21clg8E
Channel Id: undefined
Length: 55min 32sec (3332 seconds)
Published: Tue Jul 20 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.