1 billion row challenge in Rust using Apache Arrow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey folks in this video we're going to do the 1 billion row challenge only in Rust now the 1 billion row challenge was started in the beginning of this year in 2024 and it looked at how we can take a data set that looks like this which is a plain text file containing one billion observations and crunch it down into uh bite-size results where we're going to find the minimum mean and max value per station now for this I'm going to use the data Fusion library in Rust did diffusion is a query engine and it's super super powerful you can learn about it more on your own so let's dive in so first things first I'm going to create a new rust project all right the first thing we're going to do is we're going to add data Fusion we've already added data Fusion now we need to actually add token and I'm going to request the RT multi thread feature the way I'm going to approach this is by using a totally um synchronous runtime so instead of doing something like this with the Tokyo macro as people are familiar with I'm going to be using a runtime first thing we're going to do is we're going to use the data Fusion Prelude and then create our main function now the next thing we're going to do is create the runtime that we're going to use from Tokyo so we're going to say let RT equals runtime new and since we've used uh unwrap as well and since we've used the RT multi-thread feature this is going to automatically use all available threads now we don't have we already have an issue so we also need to import use Tokyo runtime runtime perfect so now that we've defined our runtime let's also create the uh execution context that we're going to be using for dat uh for data Fusion so let's say let CTX equals session context new wonderful now since what we're going to be reading in is a txt file we're going to be reading in a txt file file that looks like this this is 12.85 GB of memory and it is semicolon delimited without any headers after seeing this we can specify the schema that we're going to be using with arrow we're going to say we have two columns one is station and the other one is temperature so we're going to create fields for this so we're going to say use data Fusion Arrow data types and we're going to need a few things here we're need data type field and schema now we're going to create the fields that we're going to be used in our schema so we're going to say let station field equal field New Station and the data type is going to be utf8 and it is not nullable now we're going do similar thing for the temperature so let's just call it temp field field New temperature and this data type is going to be a float 32 and it's also not nullable so from this we can actually build our schema schema is new and then it requires a vector of fields so we're going to say station field and temp field lovely so from this we're going to also create our CSV reader options and if we look at this we're going to be reading it using the CSV reading function but this isn't actually a CSV it is a semicolon The Limited file so we're going to have to make some adjustments let's call this let Ops equals CSV read options new and then we're going to specify the delimiter to be the semicolon character and then we're going to say it has no header and then the file extension is txt perfect so now using this we're going to use our run time to read in the file so we're going to say let DF equals rt. block on and CTX read CSV path and our options now I haven't actually specifi the path to uh the big data set so I'm just going to copy in some nasty path so if we didn't actually use this runtime block on function all we're going to see here is that this is a future so nothing will actually be executed without saying hey let's await the result but since we're not actually using an async function we can't do the await so that's why we're using the runtime object and by using this runtime object we can provide this function if we were to make it into a real function to other langu libraries using something like extender for r or P3 for python now I'm going to unwrap this because I feel confident and also lazy that I don't want to uh do any error checking now we're going to do the important part which is create our query plan so we're going to say let results results future equals DF do Aggregate and this is going to be the grouping expression so we're going to actually Group by the column which is station and then this aggregation step is going to be a little bit longer so we're going to have a few things so first we're going to get the minimum of the column which is temperature I'm going to copy that because that's hard for me to write and we're going to give it an alias which is minm we're also going to look at the average of that same column and give it the Alias of mean temp and lastly we're going to find the max of the same column and give that the Alias Max temp if we just put a semicolon here we'll see we'll get a result so we're going to unwrap that and then the other important part about the 1 billion row challenge is that we need to also sort by the First Column so if we if we look back here this is all out of order and we have multiple observations for each location and instead what we need to do is have each location in order by name so we need to use the sort function or sort method and this is going to take an expression and this is going to be the column that we're sorting by which is station and then we have to add a sort statement along this so it's going to ask are we going to sort ascending or descending and then are any nulles going to be put at the beginning or end so we're going to say we're going to sort ascending which is true and then we're going to put nles at the bottom which shouldn't happen anyways because there should be no nulles now we're going to unwrap this and collect collect the results here we have a future again right so because this is a future it nothing's going to actually happen until we say hey run this to completion so that's where this runtime object comes in handy we going to say let results equal RT block on and that's going to be our results future now the fun part is we're going to make it look pretty so we're going to use the data Fusion Arrow util readyy the pretty module and we're going to say pretty format batches and we're going to provide a reference to our results says we didn't use our schema oopsies let's specify our schema here at schema okay perfect so we have no more warnings from our compiler which means it should work so let's do this cargo run release hell yeah look at that so I've ran this a few times myself and I know that this takes anywhere between 26 and 28 seconds using this pretty simple uh data Fusion approach which is super cool now with just like 44 lines of code we've used data fusion and Apache Arrow to process a billion records in just under 30 seconds and you know what that's pretty fast so I'm pretty stoked with that I hope you enjoyed this
Info
Channel: Josiah Parry
Views: 6,900
Rating: undefined out of 5
Keywords: rust, arrow, programming, data engineering, Rstats, python, datafusion, apache arrow
Id: Bc55FBwuJLA
Channel Id: undefined
Length: 9min 11sec (551 seconds)
Published: Sat Apr 06 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.