RustConf 2021 - Whoops! I Rewrote It in Rust by Brian Martin

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to whoops i rewrote it in rust my name is brian martin my pronouns are he him and i'm a software engineer at twitter i've been at twitter for seven years and i've been using rust for six years i've worked on a few open source projects in rust including a benchmarking tool a systems performance telemetry agent and now i'm working on a cache server when i'm not at a computer i'm volunteering with my local search and rescue team and i'm training my dog riker to find people who get lost in the wilderness this talk is framed around caching specifically distributed key value storage you can think of it as the standard libraries hash map but over the network cache services are very high throughput and low latency a single core can do around a hundred thousand requests per second with sub-millisecond latencies caches are used to store frequently accessed items so that we don't need to ask the database for the same thing again and again this helps protect the database from bursts of traffic and can let the application get data more quickly since the cache is so fast there are a couple common open source solutions in this space including memcache and redis which are both written in c this leads me to pelican which is twitter's caching framework that we want to use to replace our forks of memcache and redis pelican gives us a single code base to produce multiple caching solutions and to easily experiment with ways that we can come up with solutions that leverage new technologies we needed to add tls support to pelican so that we could encrypt traffic between the client application and the cache server i convinced the project maintainer who's my current manager that we could probably use rust to add tls support to pelican i had the time to work on this and i definitely feel a lot more comfortable using rust than c and we wanted to be sure that if we were going to do it this way that we would be able to match the performance of the c implementation there were some previous efforts that made me think this was going to be possible back in 2018 an engineer wanted to add a new storage library to pelican and decided to try writing it in rust the idea was to use the core of pelican and provide an ffi wrapper around the rust storage library this worked out pretty well development was quick and having a hybrid cnrust server was workable this became the first commit of rust into pelican super exciting then in 2019 another engineer was a bit more ambitious and wanted to try using rust for the actual server code they decided to use tokyo to write the async server runtime but leverage the c components from the normal cache server for storage and some of the other foundational libraries but this was pretty neat it showed we could do a lot more advanced layering of c and rust together in the code base and i was super excited to see that rust could in fact play a larger role in pelican performance of cache services is super important as i mentioned earlier we expect a single cache instance to support around 100 000 requests per second on a single core with latencies below one millisecond to ensure this we typically apply some synthetic workload with a fixed request rate and record the latency for each request into a histogram our goal is to hit the highest throughput with the p39 latency below some threshold the p39 is the 99.9 percentile of the distribution of all the request latencies so why do we care about this p39 number this 99.9 percentile well if we only need to make a single request of the cache for some higher level request then only 0.1 percent of requests are going to be above the p39 and this is called fan out if we need to make multiple requests to serve some higher level requests let's say that we want to load a user profile that might need 10 cash requests to actually fulfill that higher level thing with a fan out of 10 we multiply the percentage of our odds of seeing a higher latency so whereas with a fan out of one only 0.1 percent of requests would see a latency greater than the p39 with a fan out of 10 one percent of requests are gonna see the p39 for even the larger things maybe loading something like a home timeline you might have a fan out on the order of a hundred this means that ten percent of the requests are going to see your p39 latency the key takeaway here is that multiple cash requests for a higher level request causes us to see the tail latency more often than you think and that makes us care about these percentiles that might otherwise seem like they're way far out in the tail and don't really happen that often but with fanout they do so we have this implementation of pelican twim cache twitter's memcache fork written in rust from 2019 it's memcache protocol compatible it uses tokyo and async for its networking and it uses the c storage library to provide the caching so the question is can we use this as a starting point to add tls first we need to do performance testing and unfortunately we found there were some performance problems with this implementation the throughput was about 10 to 15 percent slower this means that we'd need 10 to 15 percent more instances to match the throughput requirements for the cash cluster that means it's 10 to 15 percent more expensive to run on top of that the latency was 25 to 30 percent higher at the p39s this means that we could potentially see timeouts and errors in production to visualize the benchmarking results here we're looking at a latency curve of the p39 latency on the vertical axis versus the throughput on the horizontal axis the c baseline implementation is shown in blue and the rust implementation is shown in red and what we can see from this diagram is that the rust implementation has a higher latency throughout the whole range of throughputs additionally we can see that the rust implementation doesn't reach the same throughput as the c version so this means that the rust implementation is slower in terms of both throughput and latency that makes it a pretty tough sell to use this in production but rush should be as fast as c for this let's see if we can make it faster by taking some steps back a basic server that you might find in a lot of examples is called a ping server it's a really simple server that just responds back with the text pong every time you send a ping to it it has all the basics that you need for network service though it accepts new connections it parses some requests in this case either a ping or a quit and then either responds with a pong or closes the connection so it's a super simple protocol we start with a ping server because it allows us to prove out the performance of a design we have all the same functional blocks as a real service except we're just not storing any data but we can add all the things we need in terms of connection and request handling and debug ability and metrics this is also a great place to start thinking about adding that tls support that i was supposed to be working on to prove that the design could easily accommodate both plain text and encrypted sessions since i was concerned about making the performance match the c implementation i wanted to try and make things as equivalent as possible the c implementation was basically just using epoll on linux so we could use mio as an equivalent and not need to worry about async executors at all we'd also have a simple event loop just like the c version to get to a prototype quickly i used a bunch of foundational libraries that i had been using in other projects this meant i could focus on getting something built out quickly with a pure rust build if i had problems i could always switch to the c implementations later i also added tls support using boring ssl at this point so i was sure my design would handle both plain text and encrypted connections and then we benchmark again just like the previous chart we're looking at request latency verse request rate and specifically looking at the p39 latency we can see that these are pretty close this looked good to submit a pull request and merge into the pelican repo on github feeling good about this it was time to move on to what i was actually supposed to be doing the real goal was to add tls support to our memcache fork twin cache so let's try doing that the primary difference between twim cache and the ping server is that we have a much more complicated protocol the memcache protocol has a whole bunch of different requests with variable numbers of fields and my main focus was on getting all that protocol handling correct and making sure that it performed well so in order to avoid having to think about the storage aspect of this what i did is i made a thin wrapper around the standard libraries hash map and used that as a stand-in for the storage library so this allowed me to focus on the protocol and also start forming a storage abstraction layer so once i had this in place as a prototype i was able to do some benchmarking and make sure that i was still on the right track in terms of how the thing would perform while responding with actual payloads right so responding with a pong that's only a handful of bytes to respond but with a cache we might have items that are much larger that we're responding with so i wanted to make sure that all the buffering and parsing and response composition was all really good so i did some benchmarks and those were looking promising so we can move on at this point to adding a ffi wrapper around the c storage library and integrating that into this new implementation but of course i'd run into some issues at this point choosing to use rust versions of some of the foundational libraries left me stuck in terms of trying to use a storage library which had metrics and logging deeply integrated into it and we wanted that we need really good insight into the performance and be able to debug things um so i had a choice at this point i could either rip out the rust versions of these libraries and improve the ergonomics of the existing wrappers for the c implementations or i could maybe like figure out some way to make the c library the c storage library talk to the rust implementations of metrics and logging it really got super gross super fast and i probably spent way too much time feeling just absolutely crushed by this problem and trying to figure out some way to move forward ideas that seemed bad were starting to sound better maybe i could rewrite the storage library in rust and there were a lot of reasons why i didn't want to do this the storage library essentially winds up being this place where we run into all sorts of things that aren't super easy to do with rust we have self-referential data structures and managing memory allocations to avoid external fragmentation we've got many many linked lists pointers all over the place unsafe code everywhere it just didn't seem like the right thing to do the other hand maybe we could simplify things a little bit maybe if we start with something that's only single threaded that we don't need to think about concurrent access that might make it a little easier maybe we could just implement the hash table in rust and use c code to allocate and manage these big blocks of memory all this wound up being depressingly difficult as well luckily around this time there was a newer resource research storage design by a phd student who was working with us and this new design was a lot easier to port to rust i even simplified that design further and took it back to being designed for single threaded use so there's no locking or concurrency things that i might get wrong rust does make these things pretty easy normally but for memory efficiency reasons i wouldn't have been able to use the rust primitives for doing locking and concurrency they used too much memory for this particular use case but best of all once i went down this path i found i was able to manage all the memory from rust and not have to deal with anything terribly exotic like a custom allocator or anything to make this work so it was actually pretty straightforward once i started with a different style of storage library so there were downsides to doing a rewrite and these would apply to whether i did this rewrite into rust or c or any other language it cost me two months of work that was essentially duplicating previous efforts and even at the end of that i didn't hit exact feature parity there's more work that needs to be done still in terms of benefits though this kind of caused us to move forward to using this newer storage design which has benefits for a lot of our workloads at twitter otherwise we still would have been on that older storage library and not even been able to start tapping into some of these benefits of the new design yet so this kind of forced us to move forward a little bit uh additionally during the rewrite i identified some areas where we could improve this new design even further and these improvements would help some percentage of our workloads not all of them but enough that this also seemed like a win and lastly it helped make it more production ready it was pretty much a research reference implementation of a storage library and having another software engineer come through and take that code and really dig into it helped address some of the things that might have caused us some pain a little bit later so it's always good to evaluate this kind of stuff and see if there are tweaks you can make to improve things further or areas where you can clean up and have a second set of eyes really deep into the code there were benefits purely that came from rewriting this in rust instead of picking any other language uh and really i had i've seen this quote before on the rust lang page um but it rang extra true after working on this project rust does empower everyone to build reliable and efficient software the code you produce winds up being high performance and you're able to code with confidence in reliability the language has all these awesome features and tools around it and those really help you write good solid code additionally like zero cost abstraction winds up making you able to express these things in ways that you might not be able to in other languages so that's really cool at the end of the project my manager yao who's the primary maintainer for pelican uh had this quote to say about the rewrite and the benefits of rust yao said the rust implementation finally made pelican modules what they ought to be but couldn't before due to limitations of the c language this feels exactly right and is just as fast i think that's really cool coming from someone who is really good at sea and has been working in this field for a long time it's really encouraging to hear that rust can express some of these things in a way that's a little more elegant and fits better for the particulars of the software i thought this was really nice this made me feel good about what i did as i mentioned before rust has awesome tools some of the tools that really helped me out a lot in making sure that my code was performant and solid were cargo bench with criterion doing micro benchmarking of critical components like parsing and the storage library itself and how the hash table works and all these really low level things being able to have micro benchmarks for those critical components really helped me make sure that each of these was very efficient and that as i started making more changes i wasn't going to have a performance regression additionally i used cargo fuzz a lot for the protocol library i wanted to make sure that through using fuzz testing i was able to find some edge cases that i hadn't quite considered in terms of malformed requests so that's one of the really great things about fuzz testing is you don't need to think of all these edge cases yourself you can have the computer start to discover them for you and when you find something that breaks your code you can add a new unit test for it and this really helped me feel like the protocol was going to be rock solid and that i wasn't going to have any weird like crashes in production for having like uh array indexes be flipped around or all sorts of other things that i encountered while i was fuzz testing so these two things helped me feel a lot better about the code that i was writing here i'm showing basically the benefit of the the zero cost abstraction and the traits and type system of rust uh and generics really um so here what i'm showing is that seg cache which is the resulting cache server is a combination of the memcache protocol with the segment based storage and we're able to use about 200 lines of code that's a lot of boilerplate initializing data structures and the runtime and all that stuff but really the code between the ping server and said cache are very similar except they differ in terms of what storage engine they use and what protocol they use so here we're showing the composability of these things and how we're able to neatly express that with traits and generics so that's i i thought this was pretty neat at the end and when we write a new storage library or additional protocols they'll be just as easy to compose together uh some more testing to show off the end results here i'm comparing upstream memcache in blue with the rust implementation of pelican seg cache in red and i'm going to zoom in on the end of this graph here at the higher throughputs um we can see that upstream memcache actually has higher latency than this rust implementation of a cache this is super cool because memcache is really really fast so it's neat to see that we can actually meet or exceed that performance with a rust implementation so just to sort of wrap up some things with uh a summary there's more work to be done there are two parallel tracks that are going to happen at this point one is the path to production for this breast implementation it needs to get to feature complete there are still a few odds and ends that we need to do in order to have full feature parity with the c implementation once that's done we'll move on to more testing deploy a single instance out to production as a canary and then eventually do an actual deployment um and we think we should be able to do this probably by around the end of the year um in terms of having a rust cash server start serving production traffic and that's that's exciting um there's future work to do um earlier when comparing with upstream memcache i was showing single threaded performance but this implementation also has a multi-threaded configuration and so does upstream memcache and while this rust version of segcash has better performance than our internal memcache fork especially when tls is enabled it loses out to upstream memcache in multi-threaded configurations i think we can close this gap the performance is good enough for us to deploy to production but i still want to win some benchmarks that's always fun um there's additional optimization that we can do probably even for the single threaded implementation um there are areas that i want to clean up there uh additionally this work doesn't get us a redis replacement yet and that's still one of the things that we want to do and we'll go back we'll figure out what the storage needs to look like for that need to write a protocol implementation but we'll be able to reuse all the other components that we have in terms of foundational libraries and we feel confident that we'll be able to write a redis replacement for pelican in rust uh and then there's all sorts of really cool stuff that i want to play with in terms of this there's iou ring there's like user space tcp stacks there's all sorts of really fun things that we might be able to do to get even more performance out of this rewriting has costs and benefits uh the extra time could have caused missed deadlines if i was working towards a deadline luckily i wasn't for this um but i was duplicating work that had already been paid for so that that did have a cost there were benefits though the code was easier to work with once it was all rust i didn't have to fight with cmake and cargo together and some new ideas got added to the code so that's probably a win overall c and rust are both very fast profiling and benchmarking helped me get the performance to match that of the c implementation but it did take work the rust ecosystem has all these awesome tools that i mentioned before we have cargo we have fuzzers and benchmarks and things that help me feel that the code is solid and gonna perform well and then we have tools like rust format and clippy which helped me make sure that the formatting and that the code was idiomatic and that went a long way to making the code easier to work with and for my manager to review and i think pelican has a really exciting future with rust it seems like we'll be able to migrate the whole code base to rust and i think there's lots of fun work to come thanks for listening to my talk and thanks for to all the folks involved in putting together this conference feel free to reach out to me with any questions bye
Info
Channel: Rust
Views: 7,927
Rating: undefined out of 5
Keywords: rust-lang, rust, rustlang
Id: m-Qg3OoPIdc
Channel Id: undefined
Length: 26min 48sec (1608 seconds)
Published: Wed Sep 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.