Rust Vienna Jan 2024 - Serverless Data Pipelines in Rust by Michele Vigilante

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right well hello guys um uh my name is Michela vigilante and I uh am hold this is a talk about surve data pipelines in Rust um bit about me uh I am a data engineer at a company called rency uh I previously worked uh writing some automation software in cc++ uh I was a backend engineer most of my time uh for uh company called first bird um programmed in Scala a lot um and so the Natural Evolution there was the lowlevel plus type system uh rust so that's uh kind of why I moved into Data engineering with mostly Scala and rust and I'm a pretty simple guy I like coffee video games and cats those are my two cats uh vow and muggi and uh yeah um first questions first um who here knows something about data engineering or has an idea about what data engineering is okay oh it's more than I expected okay um so what is a data pipeline uh as a the great philosopher CH GPT once said uh a data pipeline is a set of data processing steps that transfers data from one system to another transforming and consolidating it along the way um and that's generally it you're moving data from A to B applying some kind of transforming uh operation uh or and then also uh minifying the data set uh this is where uh this term like map reduce comes from which is used a lot in data engineering um so usually how that looks is uh you have something like an ETL pipeline or extract transform and load pipeline um basically you have some kind of data source uh where you ingest some kind of data either via events or files or some kind of database or something like that and then the ETL pipeline goes through this process of extracting that data um then transforming it so applying whatever transformation you need maybe converting some types or something like that and then in the end loading it into some data warehouse or a data lake or something like that and that's basically it uh talk over no um generally what's uh uh what's Difficult about data engineering is the size of data that you're processing mostly um so there is this term floating around all the time called Big Data um and usually you're talking about very very large data sets and a lot of people misappropriate that term because usually their data sets are not that big um and usually you don't need tools like the following um this is kind of the data engineering landscape nowadays um there's a lot of python a lot of java and Scala a lot of SQL you have all of the big Frameworks most notably something like spark uh Hadoop was kind of the first attempt then you have some streaming data pipeline tools like Flink um and yeah um most not something that's missing here is rust rust is very new to the data engineering world but there is a big effort on trying to do more with rust uh in data engineering as a data engineer you usually serve um data analytics team or some kind of AI team some machine learning team or something like this um it actually the reason why there's so much python in data engineering nowadays is because it actually comes from the data analytics uh sector where um people needed to analyze large batches of data and um the data analysts are usually not Engineers uh and so um when they build very very complex pipelines it at some point becomes not very sustainable and that's why kind of this practice of data engineering uh forked off and is kind of its own place uh that does most of the data handling and analytics then analyzes that data so why rust um rust is a great fit because uh it it's very performant and it scales very well um there's a lot of safety guarantees uh and my personal favorite is the error handling I love error handling in Rust it's awesome um there is a very good tooling for us there is a great ecosystem good Community um and it's very low cost in operation uh because it has a very small footprint in the end um and that is also one of the parts uh as uh um as was mentioned why uh you can deploy such stuff in a serverless environment for example um and the last last but not least uh I just want to use rust I like rust I've been evangelizing rust for a long time uh I always try to get uh at my company to to use more rust and stuff like that um so yeah this is of course an opinionated talk and these are just my opinions uh just as a disclaimer so let's go to data Fusion directly um so data Fusion is kind of what enabled me personally to think about uh using rust for data engineering it is a query engine um/ data processing framework how you you want to see it and uh it was authored by Andy Grove and um there's a pretty funny story behind that Andy Grove is a is a person who he was reading a book about query engines initially and then he was like H this would be interesting to implement so he started with parsing SQL and um tried to basically write his own query engine based on that book uh and initially started out in Java and I was like hm rust seems kind of cool he started using rust and then a lot of community members started helping him and pointed out like hey there's um some cool things that you can uh use for this uh most notably Apache Arrow um and uh right now Andy Grove is uh very senior in this area and he's the uh the PMC chair uh which stands for project management committee uh for the Apache Arrow project uh which is quite large and he is currently working at Nvidia working on quing using uh gpus um uh there is a open source Library called race SQL uh where a lot of his work goes into and uh ultimately he donated whatever he came up with initially to the Apache Arrow project uh there was a lot of interest around that there's a lot of community interest around that um this wouldn't be a rough talk without showing some benchmarks um I cannot take credit for this this is of course just some blog that I found it was quite difficult to find a benchmark that compare the different kinds of solutions just because um being purely a query engine um it's very difficult to compare it to other things but uh so this is based on a data set of like 5 and a half million records and 27 megabytes and some let's some guy on the internet actually the guy's blog is called Confessions of a data guy he uh gave it a spin and tried it on this data set and uh tried figuring out essentially how how fast this can go um so it is very very fast uh that is basically the gist of it um so I already mentioned a py arrow before uh and it is a very fundamental point to why data Fusion is as viable as it is uh it was released in 2016 and it is a colner uh in memory data format um so basically how you represent data in memory um and that is important because uh if you traditionally like most data was represented Ro based uh and uh the problem with that is that with row based data you have all kinds of different types uh in memory next to each other and that usually doesn't uh uh result in a very good memory layout and uh storing it in a columnar fashion where all types align is usually a lot faster because you can Traverse that a lot more efficiently um and it supports all kinds of uh types uh from very primitive types to very complex types like structs maps and lists and it is widely uh widely used uh in the data engineering ecosystem actually it is used by um tools like polers pandas uh and or data Fusion there's other things that use um the the that use Apache arrow and what's really cool about it is that it has this IPC protocol uh for interprocess communication so you can uh send bits and pieces of the memory over through two different Frameworks essentially so that enables a lot of interop so all of all of these Frameworks are generally interoperable you can you can take a data frame or something and you can move it from one context into the other uh which is very powerful um yeah uh pandas actually started using this fairly recently pandas do you guys know what pandas is you have like a good idea yeah uh it's essentially like a a data frame library with which you can uh read in data query that data and uh essentially do the whole ETL pipeline a lot of people do use pandas for it but it is it was Notorious for being extremely slow um uh yes yes it is python although less so nowadays especially since version 2.0 they are using uh actually I think they're using the rust implementation for Arrow under under the hood um and there's a lot of overlap between data fusion and polars for example um polars was kind of the fast alternative to pandas uh that came out a couple of years ago and people started switching to using that um but there is a lot of overlap between these two projects they use some of the same crates even um so yeah it's pretty cool um one thing that I like personally more about data Fusion especially when coding in Rust is that um so both of both data fusion and polar are viable for writing uh uh data pipelines in Rust uh however the the rust API for Polar is not great I would say it was very difficult uh and very inconvenient to use in Rust because it's mostly the API is mostly there to serve as a layer uh to to to serve as a layer for the python bindings essentially and it's very noticeable when you work with it and another key feature that uh is really really cool about data Fusion is there is a second crate that kind of got spawned out of this which is called Object Store and Object Store is essentially a uh an abstraction over different types of object stores think of something like AWS S3 or Google Cloud Storage it's just this notion that there is some some kind of file system that has some files in it and this is a this is a generic API over all of these uh systems and it is natively implemented for ads3 Google Cloud Storage uh and azour and similar things also for local file system which is interesting for testing um so to kind of explain what a query engine is I'm comparing it to a database management system so any kind of dbms has a huge list of features uh that it actually incorporates um like the storage part the catalog part query engine user management imp permissions clustering uh multi- a whatever like there's a there a huge amount of uh implementation in any kind of dbms system and they're very heavyweight tools and for a lot of cases uh in data engineering you don't need to go that far you can and a lot of people do but you don't need it um and data Fusion sits essentially just in this query engine part I'll go deeper into that a bit later uh and the object store would be kind of on the storage layer so the question is when do I use a tool like data fusion um there is this uh term that has been coined uh which is called medium data um so I as I said before uh a lot of times when people say big data they don't actually mean Big Data they mean something more like this something that is too big for Exel and too small for something like spark uh and it is mainly used or mainly useful for slow moving data sets uh anything that doesn't need like sub millisecond or uh even just millisecond response times uh mostly for like analytical workloads or something that is not real time as for example analytics or machine learning or stuff like that yes anything that makes your Exel crash is probably I I I don't know I would think that around like 50 Megs it starts to get iffy uh yeah yeah some companies reach that limit and that's quite nice to see that because then a penny and this is the point where a lot of data analysts start using tools like pandas and try to pre-process the data in some way right or the map ruce essentially right um all right so this is a short example of um how to use data fusion um um there are two apis there is a data frame API and there is a SQL API um so basically what you do is um this is maybe like for people that are familiar with um with spark um there is always this kind of session context you always run in some kind of context that you can set up beforehand um so that's the first thing that you do then you need to create your data frame uh which is you can imagine just like a table essentially or a table in memory um so here we just load some kind of example uh CSV um you then create your execution plan uh the execution plan you can think of as in database Management Systems terms where you have like a query plan um so here for example we want to do a filter operation uh where we filter for uh any um value in column A that is less than or equal to uh column B then you aggregate that uh here on the aggregate you have the first expression which is a uh which is the group by Clause so we Group by a and uh the second one is the aggregate function which in this case is just uh a minan of uh or the smallest value of column B and then you um there is a limit and then in the end you execute that plan uh by calling this data frame do collect function so the result of this collect function is uh a record batch uh it's one of the arrow Primitives uh so if you have kind of columns in Arrow um a a batch of those columns like multiple columns together is what a record batch is basically and in data Fusion data Fusion exposes like all of the arrow primitive types and you can actually dig down and extend data Fusion uh by by working with these Primitives you can do userdefined functions user defined aggregating functions all kind of stuff um and then it tends to get a little bit more complex you need to actually read a bunch of documentation it is fairly well documented um and there's very good examples uh in in all of these projects generally so that is what I base a lot of stuff on is their examples are usually very good yes of course I'm not sure so is it only behind the scenes and now we have again rows for example I could Vector first second and so on and so on so it's only this row based thinking format and so on and so on only as behind the scenes structure basically yeah so you can think of it in role based terms um but in in the background it's all column based essentially yeah that's more important for the execution than for you implementing something but but uh because we are so yes exactly yes many interested processes access yes yes it doesn't touch them yeah that is actually a very good point uh so this is where a lot of the optimizations of the execution plan happen right because you can leave out a lot of data essentially um so here is the same examp example in SQL um so essentially now you just start to work on the context instead and uh the SQL gets translated into uh essentially execution plan how does that happen um so the way that that works is uh there is a query passer in data Fusion that builds the as essentially of the query and uh builds out of that the logical plan uh the logical plan is only the first step in the the whole query plan uh the data frame API uh goes directly to the logical plan level uh but then it applies some kind of optimization rules like for example leaving out columns or um that if you're using a format such as parket or something you can have um push down predicates where you uh execute it directly against the parket instead of actually loading everything into memory because Park supports stuff like that uh and other kind of optimizations um then you have the the physical plan uh which that works on a on a different level um that is working on directly with the the execution together so the logical plan kind of optimizes the things that are based on the query semantics more or less and the execution plan then goes one level level deeper and optimizes that on the low level and in the end that gets uh uh executed um for data sources uh it's fairly simple for data Fusion if you want to incorporate any type of data source it just needs to implement uh a table provider trait uh that is part of data fusion and um you can do what with it whatever you can make your own table provider if you so wish um to to load data from something and this is the level of of where like Object Store for example is implemented it just provides you with a table provider for a given Object Store so let's go practical um so story time uh you're working for a company that analyzes City traffic you're a data engineer and you receive a request to help the analytics team to look into New York uh taxi trip record data this is a very widely used data set in all kinds of examples for data engineering um actually this is pretty cool because the city of New York participates in some kind of open data thing where they publish a lot of their data that they have um so all of this is public and you can use it that's why it's great for showcases and examples and stuff like that and the tasks that you're assigned with are that you need to serve the analytics team with uh some data it would be average trip duration average trip cost um other kind of metrics that we would would be interested in uh we would want uh also location based data like where are people picked up and where are they dropped off uh inside of New York and we want to process all of the data from 2014 to 2023 um and we want to aggregate all of that by month right so task number one is to get the data actually and so this data is actually hosted on the New York city.gov site it's stored in parket and it has a predictable URL format here it is um basically there are in this data set there are a couple of things you have the Yellow Cab trip records you have the green taxi trip records and other things uh there's also some Uber data here but it's only very small but for just the Showcase purposes I just picked the Yellow Cap data it's the the easiest essentially um what's also really cool here is that they have data dictionary is hosted here so if you want to know what all of the data means you can look into one of these for example the yellow trips data um these are the columns that you can expect uh you will have some vendor ID the pickup time passenger count drip distance location IDs how how much the person paid how is that uh how is that fair up essentially um all kinds of interesting um data so because we're in data engineering land some python is okay I would say so to download this it's just a python script to just quickly walk through all of the URLs that we're interested in and store them somewhere so that's going to be the first step we have now our parket data um but there is a problem the data set is very large uh or it's quite large it's 10 gigabytes uh and we need to do some pre-aggregation as the data analytics team like their Exel is crying um and the data needs to also be interpreted uh it's not super transparent what each of the thing things mean and is usually left left to the data Engineers to actually figure out um what the domain actually is um then you get to task number two uh build a pipeline for this you know rust you have limited resources the whole thing needs to be reliable and it needs to be compatible with analytics tooling so you sit down and you start architecting so what can we do um on the one hand as I already mentioned you have the New York City website and you want to download all of the data into an S3 bucket it's a suitable Object Store um S3 buckets have this capability or you can essentially attach uh an SNS topic or you can publish messages to uh SNS um when a file gets uploaded which is great because then we can just uh subscribe the Lambda with an sqsq to consume that message and just trigger whenever a new file gets uploaded and we can process that file and the output of that uh we can then store into S3 now Lambda is a um a Lambda is a great uh uh candidate for this because usually the data set is not going to be massive so we're not talking about even minutes in processing time it's going to talk take a a lot less than that um and you only need it when there is a file that's uploaded and another another great thing about Lambda is that it can scale like you can uh you can have multiple lambdas running in parallel um so in case someone does a huge data dump and drops a bunch of files in there uh you can actually process them in parallel and you get all of it for free because that's part of a NDA and the analytics team can then just uh query the the output that that you have on a three that would just be an example on how you could architect this all right so let's get to writing some code uh first we want to have some kind of input date uh in this case the 1 of January 2022 uh we give it the path that it should load from uh this is just a local path uh all of this was implemented just for local use for now um and then you create your session context and load the data frame based on the input puff um then you need to transform the data so you essentially just select the columns that you're interested in uh in this case we picked the pick up and drop off time the locations the distance the passenger count the total amount of the trip and the tip amount um and then at the end we compute uh kind of the duration of the ride based on the pickup and the drop off time um the next step is to then actually aggregate the data so we said we want averages for most of the things um so we Aggregate and we Group by year and month that was the requirement um and we count uh how many trips were actually happening and then we do an average over the columns that we're interested in um and then at the end we sort by month and that's essentially it that's what we need to aggregate the data and the last step of the L pipeline right we had extract we had uh transform and the last step is to load the data somewhere and in this case it's just some kind of output directory this would be the S3 bucket basically so let me show you some code so this uh project is sorry that's the wrong one uh this project is uh on GitHub I implemented it uh um for fun and so it's actually it has most of everything there um there's two binaries to this uh so you have the the just a normal binary without anything uh just for easy local execution and then there is a uh Lambda main entry point for the Lambda itself uh is this big enough actually does everyone see yeah okay um and this project itself utilizes a neat tool called uh cargo Lambda uh which is a great tool to just bootstrap your Lambda development um it's it's very fast to get into so let's take a look at how that actually happens um so this is just a a normal rust Main and this Lambda runtime uh this is actually from AWS uh they have this Lambda run time crate that they maintain uh which is just a wrapper for this function or this Handler function that uh gets uh called when you call the Lambda itself and you get past the event as the argument in this case I'm just doing a simple string as the input I parse that string and then I actually process uh the month that I'm getting so the the input for this is just the the the month that I'm interested in uh and so this is mostly the code that you've seen before um it's just a little bit more verbose But ultimately it is the same thing and um let me show you how that initial data set looks uh that initial data set looks like this um so this is in parquet format and funny enough I'm using data Fusion CLI to create this uh which is mostly a Showcase of how quickly you can so data Fusion CLI is a project by data fusion and it's mostly a showcase to demonstrate how easy it is to write something in uh with data Fusion and the whole CLI application is written in a couple of hundred lines of codes like it's not very extensive but I actually use it a lot day to-day to do these kinds of trivial queries you can querry csvs paret whatever oh like the the parket data format is uh par I should have explained it sorry um par is a binary data format um for columnar data um and it what makes it very good is that um you have a very small um at rest size so the the file size at rest when you're not doing anything with it is very very small uh because uh it stores all of the data in cumner fashion and then uh uses techniques such as D duplication and you can compress it and essentially you can really really reduce the size of the file um compared to something like CSV so stupid example if you have like a 100 Meg CSV file or something in park it could go go down to something like 20 megabytes or something like this is already Yes Yes itself yes Park K uh from what I understand it has some kind of header where it basically tells you what the schema of the file is how to interpret the data um and stuff like that yeah um but that's very implementation detail U I'm not super familiar with this but uh it is a great format because at the end it reduces the S size significantly like um for an example uh we uh or our company what we do is we process a lot of clicks uh and we ingest a lot of clicks also from third parties and there are a lot of click that we track uh essentially and the size of all clicks for 2023 and there's like Millions millions of clicks uh in park boils down to like six or seven gigabytes or something like this which is very very small if you have this in a dbms system or something you're talking about hundreds of gigs you're storing um yeah so I'm not going to show too much more of the code you can check it out uh on on GB the GitHub if you want um but I'm going to demonstrate this uh running so something cool that cargo Lambda gives you is you can do cargo Lambda watch let's do the release and uh this will um kind of just check your code changes and um recompile if you need to do anything and then you have a cargo Lambda invoke command to actually run the Lambda so I hope I still have the I hope it doesn't need to compile too much okay yeah good um so I'm running for the first of January 2022 right and that's it um this datas I can actually check real quick I don't know how large this is this is 36 Megs so fairly sizable parket file that's pretty big actually um and that took like absolutely no time to run uh I'm printing the like all of the states of the data as it goes through the pipeline um so this is just a raw input format then you have the data set that we have selected uh then you have um the the trip stats in aggregated fashion and also the location stats also in aggregated fashion which is something I didn't walk over but essentially we're just aggregating per location ID how many pickups and how many drop offs actually happened so so we did it that's our pipeline um but let me for a second take off my data engineering head and I'm going to put on my data analytics head and I want to actually see something because as a data engineer you don't get to see a lot of stuff outside of just some data um so what's maybe interesting to show I have some um plots that I honestly it's a lot of CH GPT python code uh I didn't didn't look too deep into it but um I don't know I came up with some interesting results um so this is all of the trip statistics mapped over time um and here you can see for example what was interesting to me is that the average price in 2023 for example uh went up to $29 where in 2021 for example it was around 19 or something so huge increase you can see how large this is going disregard the uh the spikes here that's just errors in in the data or I don't know some kind of extreme value that skews the average um so that's something maybe that the analytics team would like to see would like to plot and then there is a second plot uh they also made that I'm pretty proud of uh because it's very pretty uh it's a geographic plot of New York it takes some time I'm sorry it's python uh joking um but it's a geographic plot of New York City in a sec hopefully and it wasn't you it was so yeah yeah true yeah I blame CH I blame chpd of course yeah but it works and that's produ absolutely yeah and I mean not even joking I'm sure a lot of companies do that so but but yeah so this is a heat map of New York City on the left side you see all of the pickups on the right side you see all of the drop offs anyone have a guess on what the hotspots are so I would say like this area this area this area and this area airports yes very good very good guess uh this is New York airport uh this is theard no no JFK lag Guardia is up here and uh this is of course Manhattan and maybe what's interesting to see on this data set is that the drop offs happen a lot more in the distributed way uh where of course people are trying to get home uh whereas uh the pickups happen very concentrated in very specific areas um yeah so we actually did it um so now let's talk about how you might want to scale this out because I believe that this kind of pattern is applicable to a lot larger data sets than you would think I think you can process very very large data sets with this pattern by just splitting it up um so this is actually something that we did in production um instead of using just a single Lambda that does all of the work uh you use two lambdas where one is doing the partitioning and the second one is doing the aggregating so this is generally always true in data engineering if your data set is too large you need to partition it and depending on how well you partition uh the more efficient you can process the more efficiently you can process it and keep in mind like AWS Lambda goes up to like 10 gigs in memory uh you can actually have 10 gigs RAM on a lamb um so that actually suits a lot of cases um one important detail here is that if you want to do partitioning and you want to have this kind of event triggered so notice there is no orchestration tool we don't use uh airflow anything like this um if you want to trigger it like this then you also need to be aware on the trigger side of that partitioning and for this uh great use case here is uh there this is a great use case for fifo cues um because you can partition first and first out cues on AWS so for example you can use the partition key as well you can use the partition key on the on the FIFA uh SNS uh topic as the partition that you actually use uh inter for your data model um and that will stop multiple lambdas triggering for the same partition instead it will actually behave in the first and first out fashion uh and then you just output the result into S3 it's kind of the same thing and it's this is a pretty cool pattern you can actually stagger this more you can split this up more you can have more blumas chain if you really want to but yeah this goes up to quite large scale um so yeah because it's something it boxs me because I cannot solve it uh in that case when we have really huge data for example more than 10 gigs what do we do if we want to do a Medan of data set for example so I need the whole data set has to sort it and then take the midle one out of it how would you do this such so I mean if you want to do it over the entire data set then you don't have many options you will need to use something like spark and there will be a lot of inefficient data shuffling going on in spark um if you if that is truly what you need but in most cases that is not true usually you have some kind of partitioning logic that you can apply to reduce the data set size to some amount but also like uh the 10 gigs is not equal 10 gigs because uh if you're using something like parket there's a lot of optimization that happens even before loading the the data into memory right so 10 gigs of of memory could uh account for U larger files like you could even have maybe a 100 Gig file if your logic optimizes to the point where you can break down that 100 Gig file and only load 10 gigs into memory that will also work yeah so for example columns for example yeah yeah and also you need to think about there uh that parket supports push down predicates so you can basically filter the data even before loading it uh not even without the different columns but actually the values you can like filter on the value itself before you load it into memory so you can use it by doing that for example so I'm a bit over time so I'm just going to mention this uh Delta lake is a very uh cool thing this could be a whole talk on its own but uh essentially this is another tool in the in this ecosystem that is very good integration with data Fusion uh where you can make your Object Store um in or like S3 you can make it an acid compliant database basically um which is really cool and you can do merge operations uh you can update records you can insert records and it's all uh based on a transaction loog system um so it's immutable uh it's a it's a pretty cool tool but that's maybe for next time yeah and uh that's basically it I have some links here uh for some benchmarks uh some Talks by L who is also an Apache arop PMC he actually kind of took over mostly the project uh he comes from a company called influx DB or influx data is the company name and their newest product influx DB iox is fully rust based uh using data fusion um and then there is a second talk by Jor uh who kind of goes over more of the details of Apache arrow and more on a low lower level those are all both very good talks and that's it thanks guys very much for listening
Info
Channel: Rust
Views: 4,397
Rating: undefined out of 5
Keywords: rust-lang, rust, rustlang
Id: PK_FKzgPDWg
Channel Id: undefined
Length: 41min 2sec (2462 seconds)
Published: Wed Feb 28 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.