Data Science with Rust - Arrow, DataFusion, and Ballista by Andy Grove

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
and we're up for the archive video um all right so um welcome to the denver rust meetup regardless of wherever you are in the world uh these days it doesn't matter that we're you know we can't even have it in person if we wanted to so uh welcome everybody um and uh this is the second meetup that we've had since i guess like we restarted it after you know covid hit um and we have a speaker for next month um and then after that we'll probably take a break for christmas time and then um uh if anybody wants to talk it sounds like ahmed we might uh you might be able to give a talk um but if anybody wants to give a talk please let me know i'd be happy to schedule in for next year and uh i don't necessarily want any specific times uh that we're gonna do like every time at 6 30 on thursdays or so i figured that we can just do this whenever so that way we can hit you know we have plenty of people in plenty different time zones that can can come and uh join us um and i'm i'm going to eventually lose my um my zoom uh pro account so i'm gonna have to purchase one so i i would love to find a sponsor that would help me out with that purchase so if any of you know any uh any companies that would be happy to do that um i would be happy to do whatever advertisements they would want um uh in that in that exchange so if you know that person reach out to me and i would be happy to talk to them you know that person talk to me oh yeah i'll be happy to yeah let's talk about that offline yeah i um i sent you a private message for uh the talk anyways so we can also talk about that at the same time sounds great take care um all right well with that um i don't want to take any more time up so andy um are you uh are you ready i'm ready to go yeah let me share my screen here all right i guess i should have tested that earlier but okay hopefully um i think i'm showing my full screen hopefully everybody can see a slide i can see it awesome oops so yeah so this talk is going to be about roster data science and specifically about three projects that i've been pretty involved in so apache arrow data fusion and ballista and all of these projects are related as we see and as we go through this feel free to interrupt and ask questions as we go so just a little bit of background about me so i've been software like forever 30 years and somehow i've ended up in jvm then for most of the last 20 years and i think that's one thing that led me to start getting really excited about rust um i worked with c plus before getting into java and i kind of felt the need to get back into some system level programming and over the past decade i've mostly been working with distributed systems and query engines so i figured like a great way to really learn rust beyond the basics would be to try and build something so i took on the overly ambitious project i tried to build something like apache spark in rust there's a blog post here which i can share in the chat later on uh but i started this project about two years ago to enroll actually nearly three years ago now i guess um and that's kind of what led to these various projects so let's start with apache arrow um so arrow is a cross language development platform for in-memory analytics and arrow is two things really it's a specification and there are a bunch of libraries and different languages that implement the specification so the core of the specification is what's normally referred to just as the arrow format and this is basically a memory layout for column the data um so it covers primitive types you know like fixed width types like in 32 and 64. um it covers variable length types like strings or binary and it also handles nested types lists maps structs and so on um all of the memory layouts have some commonality the the raw values are stored in contiguous offers um for variable length types there are separate buffers containing offsets so if you want to find the start of the string you know um index five within the array use the cup index five in the offset buffer that will give the actual point is where the the value starts in the value i'm so sorry to interrupt i think i might not be the only one seeing just your editor view and not the slides if you're moving forward thanks for uh okay let me see try that again here we go multi monitor yes that's that's it all right thanks for that yeah i think i shared um yeah i'm not sure i think i showed the browser rather than the screen um so yeah you may not have been seeing this slide so it's just really talking through um this first section about the format explaining that all of these are razor backed by buffers uh containing data and offsets and then um there's a separate validity bitmap so no the representation of null values isn't stored in the buffers themselves but in a separate bitmap and that makes it very efficient um to vectorize operations let's say you have two arrays of integers and some of them may be null um you know naively if you're coding this yourself with regular arrays you'd have a for loop and then you say well if if um you know array one is no or not no and if two is not known then add them together um and those kind of branches in the loop make it hard for the compiler to vectorize operations so it's much more efficient just to add all of the numbers together even though some may not be valid and then apply uh the bitmap afterwards to get the to the kind of final values so this whole thing is really optimized to take advantage of vectorized processing and for most people that means either using sim d on the cpu same instruction multiple data or um it's really well you know optimized for gpu as well so that's the core memory format um and again this is designed so that you can you can do this in any language and you can pass this data between languages without the overhead of serialization um because the memory format is the serialization format no when you are passing data between different languages um you also you do have to deal with the metadata as well so arrow defines this ipc format uh into process communication and that's really it's really about describing the metadata so describing the schema the fields the data types and that uses flat buffers which is a serialization format that's well supported in different languages and then finally and this is uh quite recent development apache arrow now has this flight protocol which is grpc based so it's specified in protocol buffer format and this is a protocol for distributed systems dealing with arrow data so it has the concept of sending queries and fetching results or sending data so uploading or downloading and it's designed so that you can do this like in parallel across servers so unlike something like i mean traditional database systems with jdbc and odbc drivers you tend to have a single connection to one server and even with distributed systems this is a common pattern so even though you have the power of a cluster to process your queries often you're down to a single channel to fetch the data back and everything's being funneled through a single server so the flight protocol is designed to kind of break out of that pattern and have kind of more of a distributed and parallel i guess interaction style from a client so that's the specification and then um the libraries themselves so i think there's like 11 libraries now um superplus plus and java are the most mature and then the python and our libraries leverage the c plus plus libraries so you know that's pretty mature as well um rust obviously i'm going to be talking about tonight and there's a dude there's a p i just went up this week i think for a julia implementation so beyonce um so with all of these libraries they have the representation of these memory formats for the different array types that some of the implementations go further and provide these computational kernels so you have these arrays of data you typically want to perform some operations on them whether it's arithmetic or filtering um performing aggregates so c plus plus java and rust are the at least the three libraries that i know about which provide those kind of kernels and finally the next level up uh like query engines rust is the first our implementation to have a query engine but there is one being developed in the c plus plus library right now um so that's pretty that's pretty cool and that will be used um the great thing about the c plus plus one is that it will be exposed to python and r um so people using python and r get exactly the same semantics of the underlying computations okay so moving on to the rust implementation um so these are so these are really the core concepts and the rust arrow uh crates and within the apache aero project there are multiple crates so there's the core aerocrate which we'll talk about now um so the representations of the metadata so schemas fields data types and then there are arrays for the different data types so uh the primitive array type is for fixed width types like you know integers and booleans and so on and then there are actually i guess i missed some out here but for variable types like string and binary significant break times and for the things like structs and dictionary arrays is quite it's getting fairly comprehensive at this point and typically you use a builder to create an array in the first place and populate it with data although there are convenience methods provided so for converting from things like vec or vector option of some type which would say there'll be some code examples coming up so the rest implementation has a bunch of compute kernels so you can actually do useful things with the data that you have um there's a thing called a record batch so if you're representing tabular data which is very common um you can have you can represent batches of column in the data with a known schema and then finally we have some i o so for csv json and park a uh readers are provided where you can read directly into arrow structures into arrow arrays rather than having to read into one format and then convert into arrow and in some cases writers are available um the csv we have read and write i think json's read-only and parquet is uh read-only in the 2-0 release there's a lot of work on going to get the writer in place um so hopefully that will be available for the next release okay so onto some code samples um so this is a very kind of simple example of building an array of a primitive value so in this case we are using a we're building an array of n32s so we create a builder with a fixed size just say they're going to be five elements and then we can start appending values or appending nulls so the important thing here is that the so these values are going into this fixed fixed width of a contiguous piece of memory um and when we append no values we're just skipping over an element but we're writing to the bitmap to say that those elements are null so whatever values are in there just get ignored basically and then when we call a finish method it gives us an immutable array from those values um and here's an example of using one of the compute kernels so this is just i copied this from one of the unit tests and in this case we're creating two arrays using the from method to be just converting vectors to arrays just as a convenience it's not the most performant thing to do so we have a an array of in 32s and then we have a boolean array and here we're demonstrating we can call this filter kernel and pass in references to these two arrays uh so basically it's using the boolean values as predicates to filter the n32s so in that boolean vector we have two true values so after running this we the the result is uh an in 32 array with two values and you'll see here uh the use of downcast so there's one trait for array and obviously there are many types of arrays so this is a very common pattern in arrow um but you typically rely on the metadata the schema information about the data set you're working with um so you can introspect the schema to know that you're dealing with knit 32 for example and then you can downcast the array to an specific type like in 32 and then you can call methods like value or is valid um to check for each element what the what the value is whether it's the nor or not so i mentioned record matches and schemas so record batch is a very simple structure it's basically just the web of arrays but it has this schema reference as well and the schema is essentially a vector of fields and each field has a name a data type and whether it has no values or not um it also has some metadata so if you're dealing with things like dictionary encoded arrays there'll be some extra message data around that as well okay so that's kind of a quick overview of the arrow crates um so does anybody have any questions so far and i can't see the chat so yeah um speak up if you have any questions okay sounds like we're good to move on so i mentioned there are multiple crates in the arrow projects in the rust sub project so the areas that cooperate with the these data structures and then the next crates we're going to talk about is data fusion so data fusion builds on top of this core aero crate to provide a query engine supporting sql and data frame apis and so yeah leverages the area compute kernels um you can run queries currently against csv parquet and in memory data um there are apis where you can plug in your own data sources as well um and in the the recent release of arrow this week um dave fusion now uses async which is a pretty huge step forward and we're using the tokyo threaded runtime to actually do the query execution and it supports partitions so let's say you're working with a park a data source typically park a you have a directory containing multiple files um so data fusion will process those files as a separate partitions in parallel using tokyo this is kind of an overview of the architecture of the query execution um so the data frame is really the main api for building a logical query plan and a logical query plan is just really a description of what you want to do rather than how it's going to be executed so it's a like a tree-like structure containing things like um there's no containing operators so at the bottom of a query plain typically you haven't operated like a parquet parque scan where it's reading the parquet file you have other operators like filter or projection or aggregate um so yeah the data frame api is typically how you build your query plan um if you're using sql the sql um code is is also using the data frame api to build the query so it doesn't matter whether you're using sql or data frame you get the same query plan um and once you and when you build a data frame even if you're building one manually you can then run secret against that date frame as well so you can kind of mix and match those two approaches um so data fusion has an optimizer it only has some fairly um kind of basic optimization rules so far and you can plug in your own so it has things like predicate push down where it will if you have a where clause in your query or basically try and push it down um as far into that query plan as it can so it filters out rows as early as possible to get the best performance it would also do things like add implicit casts between data types where that's supported and so after the query optimizer is run we end up with this optimized logical plan and then the next step is to translate that into the physical plan which is where we're really dealing with things like partitions um and use of the tokyo threat pool and kind of sort order and those kind of things um so for example if you're if you have let's say we're doing like an aggregate query like a select group by with a sum and account um the logical plan is very simple for that there's a single operator for the aggregate but once we get into the physical planning what we really want to do is parallel lines as much as possible so the way a hash aggregate would be executed um an aggregate query would be run against each partition in parallel and then the results of those aggregates would be combined down to a single partition and then a secondary aggregate gets applied to get to the final result um the physical plan is pluggable as well um so people can use this in fact some some people are starting to use data fusion now but with their own back ends so they can use they can take advantage of the ux all the way through from sql through to the optimized logical plan and then they can provide their own physical planner if they want to do something completely different with query execution like running a distributed query for example so data fusion is purely in memory single process so so far the data frame supports these operations so it's not as comprehensive as we need to do um a lot of kind of real world things the thing that's really missing right now or the next thing uh joins um and there is work starting on implementing joins for a future release but it can do a good job of you know aggregates and filters and so on okay so here's a code example um this is a pretty simple example but this is using the data frame api so we start off here by creating an execution context and then we can start putting methods to build up our query plan so the first thing we do here is recording a read parking method and passing a path to a part k file or directory um so each of these methods is returning a data frame at any point we could go and execute that but as we're recording these methods we're really just building up the the representation of the query that we eventually want to run so we read we after calling the read clark a um so we can then call select columns this is a simple projection based on column names so again that returns a modified data frame and then it's calling a filter so we're going to filter out based on filtered rows based on the id column so we're going to filter so we just get rows where the id is greater than literal value one so again nothing's actually happened at this point we're just describing what we want to happen and then the here we're executing and we want to collect all the results back to this um into memory basically so the collect methods um which is an async method this is where it will go through this whole process of translating the um logical plan into the optimized plan and the physical plan and then then actually take care of the execution um so here's another example this is the sql api so in this case so when you have sql you know you're running queries against tables um so you have to we need to kind of register those tables so in this case we on the context we call it a method register parquet and there's one for csv as well um we provided a table name and then the path to where the data lives and with that done we can go ahead and run just regular sql and and again the so the sql method returns a data frame so again we just call collect if we want to just run the query and get the results there are other methods and so rather than correct you could say you know save as csv file and in the future you'll be able to save to park a and other formats um so under performance um so this is an unfair benchmark and i'll explain why in the next slide um it's what i did here i compared i took so tpch and tpc the transaction processing council they produce a lot of kind of industry standard benchmarks that uh supposedly represent kind of real world usage of databases and analytics platforms um so i chose that i chose tpch and a lot of the queries in this benchmark suite require joins query one is one of the queries that doesn't require joins so that's why i chose this one and this is running an aggregate query with quite a number of aggregates and one of the nice things about the tpc benchmarks is that you they provide a data generator and you can choose to generate data at different scales and it's a great way for testing scalability in this case it's a small data set just 100 gigabytes um so i tested running this query 100 gigabyte aggregate query with data fusion and apache spark the current ga release uh with varying number of threads and what this shows is the the query time sparks a bit faster on a small number of threads played fusion is about 30 to 40 faster with a higher number of threads which is cool um but i think my takeaway from this is they're kind of roughly equivalent um so that's kind of a good sign because for in previous releases data fusion wasn't really um competitive um with other platforms out there as you would expect being a fairly new project but it's reached a level of maturity now where um yeah it's much closer comparison so once away this is an unfair benchmark and it's unfair in kind of both directions so apache spark is a distributed system um it's designed to scale on clusters of hundreds or thousands of servers and data fusion isn't stage fusion is just using a thread pool um so it hasn't you know it's a simpler design it doesn't have the same overheads and also apache spark is really optimized for big data and 100 gigs isn't really big data so apache spark will go through some effort upfront before it even starts running the query it will do cogeneration um so to take advantage of the jet compiler uh in java so on the small data set that overhead can have some impacts and the spark would probably do better on you know much larger data sets but conversely uh data fusion scales down much better so i typically typically see that spark needs five to ten times more memory than data fusion which is pretty significant so for some cases like if you are dealing with smaller data sets in the you know below terabyte range it may well be advantageous to use something like data fusion okay so that's the end of um the data fusion data fusion part are any any questions on that part so you mentioned you're doing um a lot of work in etl is that right yeah i mean that's been really been the main uk use case i've been following um that's what's kind of fueled my interest in this so yeah sequel and date frames like operations um yeah doing those kind of transformations so um spark being a distributed system i know that it's being used a lot for etl and typically being done in the cloud where you're paying a high cost for the memory that you're using right is that probably the wrong architecture for companies that are doing etl and something like this could be super beneficial for them by starting a new process every time a new file comes in if it's small data yes it really depends how big your data is i mean if you if your data fits on one computer you shouldn't be using spark in my opinion um spark i mean the real benefit of spark is when you you know it's just not feasible to do it on one computer um and you can do it on one computer i mean if like some of the computers like in my day job um being i'm using like computers with hundreds of cores and you know very powerful systems terabyte of ram um so you can kind of do a lot um but i think one one of the things that's promising about data fusion and ballista which we're going to come on to is that you know because it is five to ten times less memory intensive you can do so much more on a single node um and once you go to multiple nodes you have inherent overheads you know regardless of how good your technology is you have all this overhead of the data being sent over the network and shuffled around um so the more you can do in a single node um [Music] the better really um so like today with spark maybe you need five nodes because you can't fit in memory but with data fusion you can just if you can get it on a single node it's going to go a lot faster than sparkling five nodes at least that's the theory and there's a lot of the memory overhead in comparison just due to garbage collection or is there other things going on though i don't have a definitive answer to that i mean i think garbage collection is i mean that's definitely a big factor um [Music] but yeah i don't have definitive answer i guess um so i'm not i mean i know spock pretty well at this point but i'm not like a total expert in it um you know it's designed to do everything in memory as is data fusion and java's use of you know java there's a lot of object creation going on and so i i was i mean my gut feel that is it's mostly the object creation and garbage collection that really um really has the biggest impact there all right thank you so uh question which is uh is this wonderful aero project ready for production arrow is already used in production generally um so but maybe not specifically the rust implementation of arrow um i do to my knowledge nobody's using the rust implementation of our own production so far um i do feel it's on i i do feel it is production ready the at least the the core aero crate maybe later fusion not quite yet um but it pretty much very much depends on your use case and would you also say it depends on the specific language implementation i've noticed like on the c darp side there's very much that's marked as not implemented yet very much i mean that's one of the things that's kind of interesting that it's one project one repo and the release is uh like the two zero releases for all of the sub projects um but all of the projects are at different stages of evolution um so yeah it very much does depend on their language as well one thing that's kind of significant with arrows so the project i think is about five years old um but the actual the format the specification that got to a 1.0 release a few months ago so we have this kind of commitment now that the specification um you know any changes now will be backwards compatible that's obviously kind of a big deal um and it's also kind of confusing because now we've just had the arrow two zero release um but it's still using the one zero version of the the format or the specification okay so so now we move on to blister which is kind of the last part of this um so blister isn't so one thing you need to be clear on so apache arrow and data fusion are both part of the apache arrow project um ballisticism blister is just kind of i guess my project um and this is really back to where i started this is me continuing with my um kind of overly ambitious goal of trying to build something like spark in rust um but as i go through this journey i kind of try things out in blister and the things that i find that work out well um you know most of the time that gets contributed back to data fusion or arrow so this is kind of my use case for driving um driving the you know driving the road mapping error if you like um i mean obviously other people are doing the same thing with their projects as well which is great so um yeah so what is ballista so ballista i guess i'm calling it a research project because it's mostly been me hacking away on it kind of weekends and evenings um and the goal is to build something like apache spark based on apache arrow and so in some ways it's quite similar to spark um so i kind of like the way the spark does its scheduling um so spark takes a query um it takes the query plan and it breaks into stages based on the partitioning so within one query stage you may have multiple operators with the same partitioning so maybe there's a projection and a filter and a sort and if the partitioning is the same it means that you can execute all those partitions in parallel so one stage you can just go you break it down into these tasks or partitions and send those out to a cluster to be computed and then when that stage is finished you can then schedule the next stage that depends on the output of that stage um so ballista uses it the exact same model for that um but the way that it's different to spark i mean spark is like an amazing platform um the scada and jvm in general just really i feel wasn't the best choice because of the memory um but i mean for many reasons but especially the garbage collection when you're dealing with terabytes or petabytes of data um you know garbage collection can become really problematic so and that's really what you know why i thought rust was a great choice um but made some other decisions as well um so i wanted it to be very kind of language neutral because one of the issues with spark is that you kind of i mean everybody has everybody has to learn scala just your spark and you want to write udfs you pretty much have to do them in scala i mean you can do things like python but there are some overheads involved there um which they wouldn't have been if the whole thing had been kind of built to narrow it from day one but it's very hard to kind of retrofit standards like these so with ballista so i made some choices um so i chose the arrow flight protocol for all of the interaction between like the executives in the cluster because that's well supported by different languages and then the query plans themselves i defined a protocol buffer format for describing a physical query plan um and logical query plans so that those can be exchanged between languages as well there's no reason for those to be tied to rust or java um so that means you can build executives in different languages and within the context of one query being executed it could involve custom code like maybe you've got some legacy java stuff you need to call maybe you've got some new really cool rough stuff you know there should be no reason why you can't mix and match those in the same query plan so here's kind of diagram to show um the architecture so yeah blister is designed to be run as a distributed cluster with executors um so i targeted kubernetes as the main deployment platform and for local testing i used docker compose as well and everything's containerized so an executor's container implementing the flight protocol um and from that point you don't really care what the implementation is and um within the ballista project they're actually like kind of missed one on the slide there are actually three executives today there's a rust executor a java executor that's actually implemented in kotlin um but there's also a spark executor so the spark executor can take a ballista query plan translate it to an apache spark query plan and excuse it so um so already has compatibility with spark and conversely it'd be possible to call ballista from existing spark jobs so one thing i was really keen on doing um it'll be a long time before ballista obviously is comprehensive something like apache spark which has thousands of contributors and it's been around for many years so i i thought a great way to uh make it easier for people to try as if people could take like one piece of their pipeline and experiment with moving that to rust and running in ballista and have it work with their um existing spark cluster so um the current state of the list is there's a zero three zero release a few months ago and it's based on the previous version of data fusion and arrow um as i said it supports kubernetes um from a client you can use the data frame api or a sql api to build your query and then you submit that to the cluster for execution and it has roughly the same support as data fusion in terms of the different art phrases and expressions and the performance isn't fantastic yeah i would say i've got it to a point where it's kind of okay performance um but as now that there's been a lot of optimizations in the apache hero2 so i'm hoping that when i uh one of the next steps is to upgrade with this to use this new arrow release and i'm hoping that we'll see some like a good boost in performance there's also more work to do on the scheduler in ballista it's currently pretty naive and that's really one of the main areas i'm not going to be looking to improve with that have a quick demo um which i conveniently have open somewhere okay so this is just a kind of command line-based demo and i don't know um if many of you are familiar with kubernetes let me just hit play so but what we're doing here is we're um running some commands to make sure there are no pods running in our kubernetes cluster and then we're playing a yaml description which is basically saying to run a number of executors based on some ballista docker containers or docker images so here we're spinning up i think it was 12 i think i'm spinning up 12 executors here and these are the rust executives implementing the flight protocol and then like i guess it's only six and then with that running um let me just pause this for a moment actually so we're running a client for an example which is running the cpch query from sql and this debug output here is showing the query plan um that the sql is being translated into so at the bottom here we have a parquet scan of the path to where the data is and this assumes that there's a shared file system the audio executives have access to the same data um so after the par case again the next operator is a selection or a filter where we're filtering rows based on a date and then there's the aggregate query itself so we're grouping the data to run a couple of fields and then there's like a whole bunch of aggregates and some like some math expressions in there so we just hit play again here so this is no this is now running in parallel across these executors and then the group skirt um kind of combines to one of the executors and combined the final aggregate happens and then the results get returned to the client so this is a similar experience that you would have with something like battery spark for example okay um so that's kind of getting to the nc i mentioned 0.4 so yeah there's some upgrades to do um when i started on or not started but when i worked in this previous ballista release um i ended up copying and pasting a bunch of stuff from the aero project and data fusion because it didn't have the extension points that i needed and that's been one of the big focuses for 2-0 was to get extension points into that project so that i can remove much of the ballista code and really make it um really that you're using data fusion but then ballista provides the physical planner that takes care of the distributed execution and so apart from that joins is really the next big thing um because without joins it's really hard to do with lots of kind of real world problems and joins are the reason why you need clusters without joins you you know there's a lot you can do in a single node so that's pretty much it i just wanted to do a shameless plug um so i wrote a book a while ago um how query engines work it's just an introduction if you're interested in learning more about you know sql passes and query planners and optimization rules and all those kind of things um yeah check it out it's on lean pub um yeah thanks for thanks for listening there's a few links there um i'll share i guess i can upload these slides i'll figure out a way to share these slides um so we can take a look afterwards but yeah thanks and are there any more questions oh yay thank you so much uh one question um if uh if if one wanted to contribute code to to this uh to kind of start looking at uh how to how to start contributing uh where what's a good place to kind of begin it's a great question so um so for the apache arrow project um and data fusion is part of that um issues.org it's not the um i'm sure many of you familiar with jira not the most exciting thing um but there's um if you in juror basically we do have um a number of this is where we do issue tracking for apache arrow and there's actually a blog post coming out um this week about the release which will have some links in there um but in the meantime if you go to the apache jira choose the arrow project and search for the rust components that would probably be that's kind of a great place to start and we have been tagging some of the issues with labels uh ones that we think are kind of good here we go it's like a beginner label um so yeah there are a number of issues in here and that's that could be a pretty good place to get started um i'd also recommend joining the arrow mailing list there's a dev mailing list and that's a great place just to kind of introduce yourself and see you know what people are looking for help with any any one time that's lovely thank you um also would you be so kind as to share the url for your book in the chat where it's clickable or or twitter or something yeah be very happy uh we have a couple questions in the chat also uh carlos asked what are the major difficulties um or needs and occurrence due to the project so that's a great question so within the arrow project itself one of the big challenges we've been facing is the um we've been using some nightly features um specifically we've been using the specialization feature um in the um arrow and parquet crates which has forced us to stay on nightly and that's really uh i think limiting adoption of the project um we did actually have a pr go in just after the 2-0 release that resolved it for the arrow great but the parquet crate still has that issue and because data fusion depends on park a data fusion also requires nightly so that's definitely one thing it is kind of a deeply technical thing um that you know not everybody knows how to resolve so that's one area where help i think would be very useful um and another big area is the whole kind of async topic so we just implemented async in parts of arrow and data fusion and that's what it's working out really well but the parquet crate doesn't support async um we do there is a contributor working on that um i'm sure they would be happy to have some help um and we've had to so the parkour crate wasn't really designed with async in mind and like i know in data fusion we had to do some um i mean essentially we had to kind of manage our own threads to do the interactions with parquet and then use channels um to go from async to interact with the things happening on that thread um and it works but it'll be it'll be really nice if we could make everything async end to end and i think we would guess better um kind of scalability overall and beyond that um like today diffusion is coming along really well but it has a limited set of operators and expressions um which means um it limits its kind of use in the real world um so you know a great way to contribute is just to take data fusion and try running some queries against your own data and see kind of what things are not supported you know maybe there are additional cast operations or um you know maybe different string manipulation functions that need to be built in and those are kind of those are quite an easy way to get involved because there's a framework for registering like user-defined functions and user-defined aggregate functions so if there's just additional functionality at that level i think that's a pretty good place to kind of get involved awesome um another question is have you considered including tonic as an alternative to flight in a grpc queries we actually use this tonic um so tonic provides the like the grpc um i'm not sure i even know how to explain this and so but so flight in fact let me just like um conveniently have the repo here so within flight um there's the oh no it's the general okay this is the generator's compose but so the flight protocol is defined in like a protocol file and from that we generate rust code for the different kind of operations you can do in flight so um so flight has a concert of actions and results flight descriptors um but this is so tonic doesn't do this of course tonic is the like the server and the three you know the use of async to be able to be able to handle grpc so in fact i should just show the code um for this so if you look at the so we have an arrow flight crates i think as an example okay here's an example server i guess so we provide an implementation for this generated code and then um [Music] okay this is a bad example but this this is using um this is using tonic basically to actually start the service so that's not a great answer um but yeah tonic tonic is definitely in in the in there um i think we have a better example somewhere yeah so we see usatonic here for the server and so i mean eutonics really like to transport and flight is the protocol the the shape of the data going over that transport if that makes sense yeah awesome um and that's all the questions i have in chat i have uh another question you showed that in the architecture diagram that it's possible to connect bi tools through the jdbc driver does a aero flight support um you know access and permission models um i so it's a good question i maybe don't really know the answer to that i think it maybe provides mechanisms where you could implement it um but i don't think there's anything kind of specifically for that in the protocol um because they're right there there are kind of you know there's metadata and kind of different actions you can have but i don't think there's anything like explicitly in there for that as far as i know but i could be wrong that maybe would be a good question to kind of post to the mailing list as a follow-up so as far as you know aeroflight is uh all or nothing basically it's for access to the data i think it's really down to um like implementations so um when you send a query like so queries um let me just go find the phrasal verb because i think that'll be kind of useful let's see if i can find that here we go flight dot proto so there's a flight surface okay so there's a handshake um that you do when you when you connect your service there is a concept of a handshake so that's probably there's something i haven't looked at which is why i'm kind of um a bit vague on the answer here okay so one thing about flight is that a lot there's certain parts are standard there are certain parts of it where it's pretty opaque so in the handshake there's a payload which is just binary so it's down to the um implementation like as to what data is in there or what that means so yes like people implementing flight could choose to use this um you know to pass credentials and do kind of authentication those kind of things and okay i guess there is a basic auth in here his name password so yeah there is i mean so i think yeah i'm not sure it goes as far as kind of access control but you could probably do that yourself by how you interact with this like what you put in these payloads um yeah if that's yeah i think that's helpful answers i can give so okay thanks and uh what communication channels do to use most for coordinating around the ballista project specifically so for blister um yeah let me just jump over there real quick so illustrated so actually there's um discourse and getter um they're not particularly active um there are i think you know that there are a number of issues in here as well um so yeah i'd say like the the discord channel or i guess so they're great places just to hold on say hi and kind of talk about you know ask questions and see whether good places to contribute awesome and you're working at nvidia is that correct that's correct um is there any plans to adopt rust for doing cuda-like stuff on gpus directly sadly no um so like you know there are people there and i can't really talk about nvidia stuff too much but i mean there are people that are very interested in rust and you know definitely there are people that would love to find ways of you know everyday adopting rust but i'm not aware of anything like officially happening there cool all right any last questions all right well thank you so much andy for uh giving this talk this was really great thank you yeah thanks for listening that's great thank you so much this was so much fun um so we we do have a speaker lined up for next month um they are getting me the details and choosing what uh day time they're going to be given talk so i'll get that information out on the meetup page as soon as i have it which hopefully should be in the next couple of weeks um but until until then this is this is it so until next time uh stay safe out there everybody and uh happy resting thanks for hosting brooks absolutely thank you thank you and thank you again for talking this is a great talk thanks thanks guys hey brooke uh brooks i didn't see your i'm sorry i don't know why i called you brooke um would you uh i don't know i didn't see your private message did you send it to me on twitter or did you send it to me here
Info
Channel: Brooks Builds
Views: 1,351
Rating: 5 out of 5
Keywords:
Id: UnCaJFa13oo
Channel Id: undefined
Length: 50min 9sec (3009 seconds)
Published: Sun Oct 25 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.