CppCon 2016: Nicholas Ormrod “The strange details of std::string at Facebook"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

What am I not understanding about the amazing bug that Mark Williams found?

So, you have a 128 byte string and you've gotten rid of the null terminator. Did you malloc 128 bytes or 129 bytes for "data" ?

If they malloc'd 128 bytes, then what on earth made them think it was ok to write a null terminator to data[128] in the first place? And in that case, it was broken BEFORE they added the if statement, right?

If they malloc'd 129 bytes, then it would seem that their malloc is completely broken if it's a bug to check data[128] and write 0 to it if it's not 0.

👍︎︎ 51 👤︎︎ u/krappie 📅︎︎ Oct 11 2016 🗫︎ replies

I watched this live. I knew of Andrei's thing with \0 before, it's one of those amazing tricks in programming. Great talk. This is the closest we got to having Andrei at CppCon 2016.

👍︎︎ 19 👤︎︎ u/rose_of_sarajevo 📅︎︎ Oct 11 2016 🗫︎ replies

The new GCC strings could also do the '\0' trick, if they would move the size to the end; so if the layout is {data,capacity,junk,size}. Then capacity and junk can contain 16 chars, while the '\0' is in the first byte of size. This is true on big endian machines when size is small, but for little endian machines you need to do something else. Either storing the size as a big endian/middle endian number, or storing size-16.

👍︎︎ 8 👤︎︎ u/twanvl 📅︎︎ Oct 11 2016 🗫︎ replies

Anyone have transcript or article link?

👍︎︎ 3 👤︎︎ u/the_gnarts 📅︎︎ Oct 11 2016 🗫︎ replies

What was the gains of lazily writing the null terminator?

👍︎︎ 3 👤︎︎ u/nikbackm 📅︎︎ Oct 11 2016 🗫︎ replies

So, how did they fix the issue with c_str()? He says it's too slow to write the null terminator with every call, and it's undefined behavior to check if it's there before it's been written, so what did they do?

👍︎︎ 2 👤︎︎ u/Slime0 📅︎︎ Oct 11 2016 🗫︎ replies
Captions
hello everyone and welcome to the strange details of standard string at Facebook now Nicholas you might ask what could possibly be strange about standard strings and if you'd asked me that question four years ago I wouldn't have had an answer for you but now now I have some answers answers to questions like how our strings implemented why does GCC have a twenty five byte null array in most programs and what goes wrong when you are trying to make strings faster the thing is though that these questions here are not why I am here today sure these questions are what I'm here to answer but it's not why this talk exists this talk exists because I am missing the answer to some questions specifically the question that I most want to have an answer to which string is the most efficient string at Facebook which string is the most efficient string for you I don't have that answer I do know a couple of things though and the first most important thing is that strings are important if you look at Facebook code Facebook code includes strings left right and center I went on to github I cloned a whole bunch of trending C++ projects and they all use strings left right and center to hello world is a strange program because hello world includes iostream but does not include string because string is a very large part of the standard library it's a very core abstraction that everybody uses and indeed if you look at the CPU cycles spent inside of Facebook's infrastructure of the time spent in namespace standard 18 percent of it is spent inside of string inside of string we have an opportunity for impact but before we can figure out how to make strings better we first need to understand how they work and so isn't isn't this how you implement a string isn't string just a size cap City char star triple who thinks that this is a good string representation no one okay a few people there we go see I used to think that this is how strings worked four years ago before I knew anything about strings I get strings internally size capacity data makes a lot of sense and I thought this was true I was happy and then I took the size of a standard string and the size of a standard string in GCC four is is eight the size of a standard string in GCC four just barely fits a pointer okay so so what's the representation then how on earth does a GCC string keep track of its data its size how much of the space on the stet heat that it has allocated belongs to it and the answer is that in GCC a string is indeed a pointer to the heap and the heap data is prefixed with the size and the capacity all right is this all there is to standard strings turns out GCC has done two pretty nifty optimizations and the first optimization stems from the observation that there is one string one string that is much more common and popular than every other string and that particularly common string is the empty string every time you default construct a string every time you move a string you have an empty string the thing about the empty string though is it's not actually empty its heap size is one byte it must have a null terminator because C++ strings like their C predecessors must be null terminated and so here's a question do you really want to beam a locking with 20 SEC 20 nanosecond overhead one byte on the heap every time your default construct a string no and GCC doesn't want to do this either so they have an optimization they have a global variable which is the M d-string and every single empty string in your code points to this 25 byte array of zeros a quick question that comes to mind why have a global as opposed to having the representation No why not have no represent the empty string and if you have no represent the empty string then all of a sudden all of your string code has a branch in it when you take string dot size you have to check hey am i a null string and if I am the size is zero otherwise if I'm not a null string then please actually go to the heap find the size GCC does not want to have conditionals in the string code which is why there is a particularly hot 25 byte array in your program one other advantage of that 25 bytes it is very good for your memory fragmentation collects all of the null strings together in one memory location the second cool optimization that GCC strings do is an old trick called copy-on-write semantics copy-on-write semantics were outlawed in C++ 11 for concurrency reasons but the basic premise of copy-on-write is when you call the copy constructor of a string just take a pointer to the original strings data have a ref count which we see here preceding the data that indicates like a shared pointer how many people are referencing the data and when you want to modify the string actually then perform the mem copy cool detail about their ref count in GCC they don't store the actual ref count they store the ref count minus 1 which means that the default state having one pointer is where your ref count value is zero and that allows a very nifty trick which is when your program is zero load it or is loaded into zero initialized memory at boot time the empty string does not require any additional processing before it's in a valid state so this is GCC string prior to version five and six years ago this is the version of string that Facebook was using Facebook was using GCC 4.6 with G Lib C 213 and this was the string to beat and there was one guy who was like hey I can make a better string implementation and that man was Andre Alexandre skill he's pretty cool and he observed that there's lots of Facebook programs that are bottlenecked on string code sometimes it's the Malik sometimes it's the hopping around all these different pages to access your string data but strings are slow and we at Facebook we with GCC strings we're not making a classic optimization which is the small string optimization the premise of the small string optimization is we want to avoid as many Malik's as possible we want to avoid having data in random locations on the heap by storing string data if it's small enough on the stack you still have to be able to store large strings with a data pointer and so the basic implement or the basic representation of an FB string is indeed a data size capacity triple note though that the size and capacity have been pulled up from the heap and into the stack that way we have 24 contiguous bytes of space on the stack to play with then if as in this case the string is small we will collapse the string into the stack we'll have an alternate representation a union in which we store the string data not on the heap but in the structure on the stack so here we have a 24 bytes structure here's a question how many bytes of string data can we store in a 24 bytes structure if it was up to me to implement FB string six years ago the answer would have been 22 here's how I'm going to do it I'm going to have 22 bytes of user data I'm going to have a null pointer and I've got one byte left over in which I can store the capacity or the size of the institution as well as store a single flag bit so that I can identify whether or not I'm in a normal string or a small string if I was doing it would have been 22 bytes but I wasn't doing it Andre Alexandre SKU was doing it and Andre had a different plan you see there in the small string representation there's a nine at the end what is that because the size of that string is not mine the size of that string is fourteen nine is the amount of spare capacity in the string nine is the amount of extra characters that we can push into the stack before it becomes full and the magic about spare capacity is that when the spare capacity is zero ie your string is chock full to capacity zero is also the null terminator your spare capacity conveniently place at the end of the string does double duty as a null terminator allowing you to shove 23 bytes of in situ capacity as you push characters back into the string the null terminator converges to the right and the capacity drops down to zero both of them meeting at the 24th byte storing 23 bytes in situ Andre is very clever so I said there's a there's a flag bit somewhere we have to be able to differentiate between are we a small string are we a normal string and where does that flag bit go I mean if you look at that small string it's there's 23 bytes of user-defined data that we really can't touch and then we've got a zero bits where's the flag bit going to go well conveniently flag bits can be zero so no matter what the flag bit situation is we're going to have to put the flag bits in the twenty-fourth byte and flag bytes of zero must indicate that you have a small string now the spare capacity because it ranges from 0 to 23 takes 5 bits means we have 3 bits left over to be our Flags we've got some space inside of small string for some flag variables we look at the normal string we also don't immediately seem to have a lot of space to play with flag bits up for those of you who are at Chandler's talk this morning Chandler like to say that hey I'm going to use the lower bits of the data pointer because Malik's always going to return to me pointers that are 16 byte aligned but Andre had a different plan and this goes to je Mallik which is Facebook's Mallik implementation J Malik like most Mallik implementations works with buckets which means that if you ask for 29 bytes for example je Malik will internally round that up to a convenient size in this case 32 and it will then go off into a bucket full of 32 byte memory chunks and return one as the result to malloc 29 FB string knows about this trick FB string knows that je Malik will secretly return data that might have some extra space left over and so Andre put an extra optimisation in a check to see whether or not we could secretly increase the capacity we requested from Malik without actually requesting different memory this way FB strings out a little bit of extra data to play with doesn't waste memory on the heap oh and by the way as a side effect all of je Malik's buckets are multiples of 8 therefore you are guaranteed to have three spare bits at the bottom of capacity and that is where that is where FB string stores our flag bits at the bottom of capacity inside of the normal string representation there is a third mode by the way to FB string which is copy-on-write semantics for strings that are larger or greater than or equal to 255 bytes long there is a small say large string optimization we stick a ref count on the data on the heap and we use one of the extra flag bits inside of capacity to say hey we've got a large string if you guys want to learn more about large string if you guys want to learn more about FB string and je Malik Andre Alexandre SKU did a talk covering all the stuff I cover in this slide and a hell of a lot more his talk is available it's called sheer folly folly bean facebook's open source library in which FB string is published on which dave watson will be giving a talk tomorrow anyways if you want to learn more about small strings and FB strings Andre legs and rescue sheer folly there's a talk one more thing I'll mention about this is clang has a very very similar implementation of string they also have a 24 bytes structure they store 22 bytes of Institute capacity but they don't rely on the fact that malloc will always return num pointers that have some zeros at the bottom and specifically they are going to place the flag bit in a different location and the trick that they use is that most standard containers in fact all standard containers have a max size variable a max size function standard containers can specify hey you're not allowed to grow beyond a given size and clang has reasoned that no one needs a string of size 2 to the 64 so they have arbitrarily decided that strings may not exceed size 2 to the 63 leaving them one spare bit in both capacity in size that's how clang goes things okay back to FB string we have this cool string Andrei Alexander skew made it six years ago in 2010 how does it compare how does it stack up to GCC's implementation of string and the answer if you look at the assembly code is pretty bad this does not look promising here is the assembly code for string size for both GCC strings and FB string and string dot size is just two instructions FB string has lots of instructions and it starts with a conditional it starts almost every single FB string function first has to check which representation it is are you a small string because if so I have to work one way and if you're not a small string then I have to interpret the data in a totally different fashion and this branching in your assembly code is the exact thing that GCC tried not to do I did not show there we go that is the code that is the code right there that occurs in every single string function so do the benchmarks have to say and the benchmarks are confusing when I first saw I ran these numbers and I just stared at it and I had no clue what was going on how on earth can those nine lines of assembly code including a branch be faster than GCC's quick to assembly instructions and the answer all boils down to the memory layout of your program you see GCC stores its data always on the heap and also the sizes on the heap and so every time you call string dot size every time you call string dot data you have to go to likely a different page in memory load a new page and that's the real thing that slowing strings down I convert to quick tests confirm I ran this benchmark by the way over a hundred thousand strings where you do actually see the effect of page misses if I change the benchmark instead to run 64 to run over 64 strings in quick succession then GCC's benchmark because all of the data fits in my working set drops from one point six nanoseconds down to 0.3 nanoseconds it's much faster the other quick test is if I run perf I see that four hundred thousand strings GCC has l1 cache misses that are three times higher than FB strings so we're in a bit of a conundrum right now we're looking at FB strings benchmarks and it's like hey the assembly code is clearly worse but at the same time the memory layout is better what wins and the only way to find out what wins is to replace standard string with FB string which we did four years ago in 2012 one day you included string and size of standard string was 8 the next day you included string and size of string is 24 and then we went to our website team we're like hey guys can you run the website figure out what the numbers are what the performance implications are with this new implementation and our team went and they ran FB strings replacing standard strings and the answer that we got at the end of the day was a one percent performance win I don't know if you're laughing cuz that's a good number or a bad number it sounds small be our de yesterday was talked about hey I really want my 10x improvements a spoiler alert you're not going to get a 10x improvement by replacing string but this 1% is one percent across all of facebook this 1% doesn't just apply to the compiler this 1% applies to every single C++ program that Facebook runs this is really really good for us I'll let you in on a secret this is how Facebook works Facebook has these leaps where we hey we make things five times better ten times better but in between leaps we have a whole bunch of little 1% ranked wins that compound over time to make all services more efficient and this is one of them this is a really good one really good and we have kept it we have continued to dedicate engineering effort to maintaining FB string inside of Lib standard C++ so we have we have replaced standard string with FB string now that we own the string what are the cool things can we do and I think the coolest thing that we tried to do is we tried to kill the null terminator who here who here if they could snap their fingers would get rid of the null terminator in C++ there's no bugs would you there we go most people see I personally despise the null terminator it's not something C++ program should rely on C++ standard strings are allowed to contain intermediate null terminators if you are relying on searching for a null terminator to figure out how big your string is your code has a bug in it also FB string have this cool optimization where hey we'd lazily write the null terminator when people are saying hey push back elements into my string or possibly append the bunch of strings together we'll lazily write that null terminator because people should not be depending on it we even had a mode where we would explicitly write a bogus null terminator to make sure that people's code was act Julie crashing instead of possibly silently succeeded of course Easter end data did have to append a null terminator because those calls do interoperate with libraries that expect C strings so technically by the way this is illegal this actually is not something that is technically allowed by the standard much like copy-on-write which was disallowed in c++ eleven having this writing of a null terminator is not something you really want to do from a concurrency perspective but we did it anyways and it works and the reason it works in practice is because most people write good code most people write code that don't rely on the null terminator and in the odd case where they do we own the entire stack so we can go when we can fix the code so I'm kind of I'm on the fence as to whether or not I like the fact that we still have copy-on-write strings but we do still have copy-on-write strands they haven't broken irreparably for us but we don't have the dead null terminator anymore we lost that fight you know what killed us won Const you see see stir is a Const function and the standard specifies that C stir shall not modify the data in any way and with good reason our search team came to us one day and they said hey build team our code is slow and we found out why you see we have a global read-only string variable and we've got lots of individual threads that are reading from this global string and Colleen's Easter on it and from the programmers perspective see stir is definitely Const they just call c stir and they get their null-terminated strings but from the cpu perspective this function is definitely not Const every time any thread calls c stir a null terminator is appended to the back of the string which trashes the cache line on all the course this was terrible absolutely terrible for the performance of our search code fortunately we've got some smart people on the search team so they said hey you know what we're going to do we're going to see if someone's beaten us to the punch we're going to see if some other thread has already called see stir see if the null Terminator exists and only if it doesn't exist do we have to write it I like this dip by the way I approved this dip I waited for the test to pass first but then I approved it and I was happy all it was good in the world because we no longer had this problem where we couldn't write the null Terminator and I say all was good until Mark Williams came and commented on the diff this diff is broken mark Williams by the way is one of these characters who will just occasionally swoop in and fix some ludicrously difficult problem that you didn't even know you had this diff has undefined behavior this diff potentially reads from uninitialized memory here's the setup Mark Williams on his build server runs a compiler that must maintain a list of file names taken from the operating system those file names if one of them happens to be a hundred and twenty eight bytes long the string contained it will need to allocate at least 129 bytes of memory plus one for the null terminator je malloc rounds 129 up to 192 what's the magic of 192 128 is a multiple of the GCD between 192 and 4096 which is the page size what does that mean that little bit of gobbledygook math means that if you want a string that is 128 bytes long now lock might return you an address that is precisely 128 bytes away from the end of a page so if you happen to allocate a string that's 128 bytes and malloc happens to return to you a memory address that's 128 bytes away from the end of a page you will write the 128 bytes and the null terminator will be the first byte on the next page now before you call see stir what you really need to have happen is have Malek allocate memory on that second page three it and then conditionally return it to the kernel now call see stir if you call see stir when the page is being conditionally returned to the kernel the kernel is like hey you're reading from uninitialized memory I can do whatever I want I'm going to return zero the null terminator so no no Terminator is actually written when you call see stir however if next you actually perform a real right to that second page the kernel will realize that you want the actual page back and it will return to you the original page with the original data which may not contain a null Terminator as the first byte if you then do not call see stir again you will have a non null terminated string floating around in your program causing it to crash if that was hard to follow press pause rewind the YouTube video that you are currently watching press play and watch it all again and when it still makes absolutely no sense just stand in awe at the fact that there were 5 stars that had to align to cause this bug and Mark Williams found it I have absolutely no clue how that man found that bug if this bug had been dropped at my doorstep I would have just kind of stared at it and been hopelessly hopelessly lost this bug was about a two years ago it's the most complicated bug I've ever seen inside of any piece of code really but FB strings certainly so that was two years ago what about today what is what it has changed between four years ago and today C++ 14 but that's actually not the thing that's caused me to give this talk see the thing that caused me to give this talk in the first place the thing that kicked off all of these changes is that GCC has a new string representation GCC five has a new string layout and I don't know whether or not the new string layout is better an FB string anymore we've had this 1% performance optimization for four years and it needs to be re-evaluated so how do GCC new GCC strings work by the way well first thing is their C++ 11 compliant they've dropped the copy-on-write trick and they have adopted the small string optimization for the small string so of course they have now a larger stack capacity they've got data size capacity excuse me they also have 8 bytes of unused space at the end of their stack if there is a small string they will multi-purpose capacity as well as the 8 unused bytes at the end to source the store some data in situ how does this compare to FB string the first thing is these GCC strings have the small string optimization the big win that we got from FB string four years ago is we had small string optimization GCC now has that another interesting thing is that data in size both have the same representation in both formats of GCC strings this means that there's no conditional branching going on inside of data calls or size calls which means the assembly code is much nicer looking and much faster also the size of standard strings is 32 which quite conveniently fits two of them into one 64 byte cache line this isn't to say though that new GCC strings are strictly better than FB strings the big downside is there's only 15 bytes of Institute capacity and it happens to be that one of Facebook's favorite things to put in strings is base 10 serialized 64 bit random numbers and 64 bit you IDs serialized in base 10 takes up 20 bytes which an FB string is allocated on the stack but with these new GCC strings gets pushed off to the heap another thing and this is a little bit of a smaller thing a move is no longer a mem copy if you look at vectors and you look at vector instantiations the most common vectors are vectors of inspectors of strings and vectors of other vectors and happens to be that for these three very common vector types when you do a resize instead of doing a move in a destroy you can just do a bit copy don't bother with the move constructor don't bother calling destructors on the original data just do bit copies and that optimisation is broken with new strings lastly the size is 8 bytes larger than an FB string and this 33 increase in the stack size of a string does affect the memory layout inside your program so a couple bits of pieces of data I'm not really worried about the move the move thing isn't great but move is called 1 or 2 orders of magnitude less frequently then string dot size is called so the wins in size should definitely compensate for the win for the losses in move the big thing that I'm worried about is the reduction in size we're losing 8 bytes of Institute capacity and I don't know what the performance implication of that will be inside of Facebook because strings in the range size 16 to 23 is a double-digit percentage of all strings greater than 16 bytes and so we're going to run an experiment we're currently in the middle of running an experiment to figure out if we are going to continue to use GCC strings because it is indeed a fair bit of engineering effort to maintain our own custom implementation of the stripe so those are strings those are a few implementations of strings that GCC that clang and that Facebook uses by the way Microsoft's version is very similar to GCC's new strings the thing about strings though is it's not a solved problem four years ago I thought hey you know strings are a size a capacity and a char star pointer and that's it but that's not the case strings are still changing as evidenced by the fact that GCC just got a shiny new SSO optimized string implementation and the thing about these string implementations is yes it's important to have good optimized assembly code but it is also very important to understand how the memory layout of your code is affected by the data layout of a string the only real way to test this the only way for me to know which string is the best string for Facebook and the only way for you to know which string is best for you to run some tests on some real-world data my name is nicolas it's been a pleasure giving a talk to you today and I'll be around for questions Thanks
Info
Channel: CppCon
Views: 80,007
Rating: 4.9151397 out of 5
Keywords: Nicholas Ormrod, CppCon 2016, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: kPR8h4-qZdk
Channel Id: undefined
Length: 31min 18sec (1878 seconds)
Published: Wed Oct 05 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.