CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Honestly, this makes me want to bench test StringIO.

it's not optimized at all, and I'm sure I'd lose to most of those algorithms, but i'm still curious.

Also, what's going on with LLVM's Unicode library?

It looks like LLVM's converter is in LLVMRepo/lib/support/ConvertUTF.cpp

Edit: LLVM's converter is very focused on UTF16, and idk there's a lot of tables and stuff, I wouldn't have written it that way...

👍︎︎ 2 👤︎︎ u/bumblebritches57 📅︎︎ Oct 13 2018 🗫︎ replies

I feel this is a super convoluted way to implement a state machine.

I feel that implemented with clean set of branches rather than a table lookup would not only be faster on the whole, but also cleaner code that would not require assembler whatsoever.

👍︎︎ 2 👤︎︎ u/tjgrant 📅︎︎ Oct 14 2018 🗫︎ replies

As far as error handling is concerned, invalid code units cannot simply be dropped. They must be substituted with U+FFFD or the conversion must halt. The security implications are covered here : http://unicode.org/reports/tr36/#Ill-Formed_Subsequences

Thank you for the great presentation!

👍︎︎ 1 👤︎︎ u/Dalzhim 📅︎︎ Oct 17 2018 🗫︎ replies
Captions
- Good morning everybody. My name's Bob Steagall, and I'm here to talk 'bout a little bit of code that I've been working on over the last year or so, for performing fast conversion from UTF-8. It uses C++. It uses a little DFA that I rolled up, I rolled together, and it uses SSE Intrinsics for some speed in certain cases. I started working on this problem because I was under the delusion that I wanted to write a JSON parser, and part of writing a complaint JSON parser is that I have to be able to handle UTF-8, And as part of that I discovered that UTF-8, in order to find code points, you have to convert it to UTF-32, I didn't know really know what either one of those were. So I wrote something that was very simple and it was somewhat slow and I look at the problem and realized, hey maybe I can make something a little bit faster. This is a story of how I did this. I got some pretty good results and I'm here to share them with you. So I'd like to begin with a few definitions to frame the discussion you, and frame it so that you understand the code we're about to see. I'm going to show you probably more code then you usually see in one of these presentations but I promise you it's all very simple code. I'll talk briefly about what you UTF-8 is. We'll do a context switch and talk a little bit about what a DFA is. Then we'll talk about how one could use a DFA to recognize UTF-8. We'll talk about the code that I wrote. I'll show you some performance measurements and then finally I'd like to say that I am by no stretch of the imagination a Unicode expert. So don't ask me any Unicode questions 'cause I won't be able to answer them. Our approach this from the perspective of an algorithm problem. I would like to convert a stream of bytes to a 32-bit unsigned integer according to some rules and how could I do that as quickly as possible? Anything outside that context I was not really super interested in. So some definitions. What's a code unit? A code unit is a single indivisible integer element of an encoded sequence of characters. A sequence of code units, specifies a code point. You can think of code units is being atoms, and a code point as being a molecule, and it's the molecule that we're interested in. It's the molecules that represent characters. By itself, code units really don't having any meaning They don't identify any particular character. They don't identify any particular code point. The meaning of a particular code unit is derived from the encoding that that code unit is intended to represent. In modern C++ there is a list of types that are commonly use this code units for various encodings and different character representations. As it turns out you can use uint8_t to represent code units in UTF-8. So in encoding, which I just mentioned is a method of representing a sequence of characters, as a sequence of code units and code unit subsequences. An encoding can be stateless or stateful. A stateless encoding is one in which the decoding of the next sequence of code units does not rely or depend on any decoding you did previously, and of course stateful is the opposite where decoding or encoding depends on some operation that you've done previously. Encodings can be fixed width or variable width. UTF-32 is a fixed width encoding. UTF-8 is a variable width encoding. An encoding can support bi-directional decoding. You might have sequences that represent right to left or left to right in the same sequence of code units. I'm sure you've all heard of these encodings, UTF-8, 16 and 32, 8859 and of course Windows code page 1252 So we code point, as I mentioned, is the molecule. It's the thing that's interested, we're interested in. It is an integer value that denotes an abstract character which is defined by a character set. A code point by itself doesn't represent any particular character. It's just a numerical value. The meaning that we give to a particular code point is derived from the character set definition. In modern C++, and we use char, wchar, wchar_t, char16_t and char32_t to represent code points. Char32_t being the one that most people use, excuse me, for UTF-32. So what's a character set? Character set is a mapping of code point values to abstract characters. It doesn't need to provide for every possible code point value, that it can be represented by the code point type. For example all ASCII characters can be represented in less than 128 code points. Common character sets, well we all know what ASCII, Unicode and Windows code page 1252 are. A character, this is the important thing. A character is an element of a written language, the thing that we see, of which there are glyphs or visual representations. For our purposes and purposes of this talk, a characters identify, that glyphs, the thing that you see is identified by the combination of a character set and a code point value. So, a character set is a mapping of code points, where specific code points represent specific code point values or specific characters. All right, so getting into the meat. What is UTF-8? UTF-8 is a variable length encoding scheme for encoding code points. Each code point is encoded by a sequence of one to four eight byte unsigned integers. Uint_t or unsigned char, I will refer to these in the rest of the presentation is bytes or octets. But when I say byte or octet, I mean a uint8_t. The first byte in one of these sequences indicates the total length of the sequence and we'll see how that works. ASCII characters as you might expect are encoded in the range zero to hex 7F. The first byte in a multibyte sequence always ranges from hex C2 to hex F4. And we'll see why that is in a moment. Trailing bytes, in other words the bytes that come after the first byte in a multibyte sequence, always lie on the range hex eight-zero to hex BF. So when I was learning about this in trying to understand how things work, it helped me quite a bit two think of the octets in terms of their individual bits. So let's look at a one sequence, an ASCII byte. UTF-8 code sequences are, code unit sequences, the length is determined by the first byte. So in this case, the first byte begins with the leading zero which indicates this is a one byte sequence. The trailing seven bytes, which are represented by the Xs here, represent the information that can be stored in this code unit. In a two by sequence, deleting byte as the upper three bits of one-one-zero. And so decoders look at those upper three bits, they see that the upper three bits, from high to low are one-one-zero and the two ones in the upper two bits, indicate this is a two byte sequence. There is a zero which acts as a separator and then the lower five bits of the upper byte contain useful information, and then the trailing byte, the second bite begins with one-zero. This is why trailing bytes always range from hex eight-zero hex BF because the upper two bits and trailing bytes must begin with one-zero. Similarly, with a three byte or a three octet sequence the upper three bits in the leading byte begin with one-one-one-zero, followed by two trailing bytes and finally for a four byte sequence, the upper nibble the upper five bits are one-one-one-one-zero, followed by three bits of information and three trailing bytes. Now if you look at the total amount of information that can be stored in each of these ranges, a one byte sequence can store seven bits, a two by sequence can store 11 bits, a three byte sequence can store 16 bits of information, and a four byte sequence can store 21 bits of information. So, let's look at some valid sequences. John Caleb's favorite feature, C++ the right-brace. Here's a representation of the right-brace. And I've highlighted the bits of information in yellow. Here is a Unicode for UTF-8 presentation of the copyright sign, which is represented by hex 00A9, and for some reason in Unicode instead of O-X, they do U-plus. But what we're looking at is, hex A9. So the alternating sequences of yellow, green and yellow represent nibbles, in the byte sequence. So you can look directly and see that you've got in green, you've got A, and then the yellow on the right, you've got nine. And when encoded in the two byte sequence it's C2 and A9. Similarly for a three byte sequence, which encodes the not equals sign. So it's pretty easy to understand, when you stretch the bits out like this, you have encoding sequences which take bits and stretch them out, and put them into UTF-8 sequences and decoding, you take those bits and you compress them back into code points. But there is an important condition called overlong sequences. So consider the closing brace. This is hex 7D. When represented as a single byte sequence, it's a valid ASCII leading byte. It's a leading byte in a one byte sequence, and it's valid. However if you try to encode that, in a two byte sequence you would be using less information than two byte sequence is capable of representing. In other words, you've got some leading zeros there that you don't need to have. If you represented it, in a three byte sequence you have even more leading zeros, where you should be having bits of information. These are called overlong sequences because you're using more bytes to represent the code point and you actually need to, and this is not legal UTF-8. In UTF-8 you're supposed to use the minimum number of octets which are required to represent the code point interest. And I've read that there had, a long time ago there were some security exploits that occurred by feeding overlong sequences into some piece of software X. So, in doing this conversion there are boundary conditions, like any problem. There is the maximum code point value, which is hex 10FFF8, FFFF, which represents 17 planes of 64K code points per plane. There are these strange things called UTF-16 surrogates which range from hex D800 to hex DFFF. They are used to encode code points greater than hex FFFF and UT F-16. Not gonna talk about them, other than to say that they should never appear in UTF-32 strings or UTF-8 encodings by themselves. And I just mentioned, overlong sequences. Overlong sequences, Well there's no such thing as a one byte overlong sequence, but for two byte overlong sequences, any sequence beginning with hex C0 or C1 is overlong. Any three byte sequence that begins with hex E0 followed by a trailing byte whose value is less than hex 9F and similarly with four byte sequences, any sequence that begins with hex F0 and is followed by a trailing bytes, whose value is less than hex 8F. If you map out the pattern of bits, you will see that those are overlong sequences. So here's a sample converter, how you actually do the conversion. This is sort of the canonical, simple way of doing it. At the top of the function you can see I have an input array and taking a pointer. I have a code. I'm using char32 is my code point and output parameters on the right, and just have an if else ladder, where I start at the top and I looked at the value of the leading byte, I do some asking to find out what range it's in, and then depending on how many trailing bytes I expect, I do some bit manipulations in order to compose the bytes from the input octets, into the code points, and then I have a little function at the end the text of value to make sure it is a valid code point. There's a very standard way of doing this. There are lots of bits of software that can do something very similar. (speaking faintly) You look at the first byte of the UTF sequence and it should tell you how many bytes there are to advance. (speaking faintly) I'm sorry, it's what? (speaking faintly) Oh yeah, yes. The pointer itself is not being advanced. You're right. Yes, I should have passed the pointer in by reference. All right. So, I'd like to talk about what a DFA is. A DFA is a deterministic finite automaton. It's a finite state machine that accepts and rejects a strings of input symbols. DFAs recognize what are called regular languages. They're useful for simple pattern matching. They're defined mathematically by five attributes. Something, the thing that has a finite number of states, a finite set of input symbols called the alphabet, a transition function, which determines transitions from state based on the input symbols. A start state, where you begin your matching and one or more except states where you conclude your matching. So how does it work? Well given the current state and a pending input symbol which is typically called the lookahead, the transition function specifies the next state. So you begin at the start state, of which by definition there is one, you consume symbols and execute state transitions until recognition halts. Well, when do you halt your recognition? You halt recognition when an except state is reached or there is no transition leaving the state, in which case you reject the string because you cannot get to an accept state. DFAs are fairly limited in languages that they could recognize. They can recognize very simpler, very simple regular expressions, where you have concatenation or clean positive conditional closer or alternation. They cannot solve problems that require more than constant space, such as matching properly paired sets of parentheses. For that you also need to have a stack. This is an example of a very simple DFA. It will match any number of leading spaces, followed by an optional plus or minus sign, followed by a sequence of decimal digits, and here I'm allowing zeros, leading zeros in my integer. So you can see that I have start state, I have transitions out of the start state, based on what my input is. If I were going to represent it as a table, the table might look something like this, where I have my states in the rows and I have my inputs in the columns, and for example, if I'm at state zero which is the begin state and I get a digit I can immediately go to state two, which is an accepting state. But I keep trying to accept more digits, and when I finally find something that I can't accept as a digit, I know that I'm done. I've recognized something. I'm in the except state, so I can keep that. If I have received a sign, a plus or minus sign from state one. I go to state one, which is digit. I'm sorry. I go to state one if I get a digit, from state one then I could go to state two. If I'm in state one, and I get anything else except a digit, I reject it. and so on, and so forth. Very simple example of a DFA. So how can we use this to recognize UTF-8. Well remember, we have boundary conditions that we have to follow, we need to keep in mind, as we're trying to build the DFA, and here I've just repeated them. So when I was working on the problem, I tried to figure out, how can I find the transitions in the DFA that accounts for the boundary conditions? And we're not gonna go over all of these slides because they're very detailed, but they're in the presentation in case someone wants to look at them. But what I really did, is I started in the left-hand column and I started working down and mapping out the binary and hex versions for code points and looking for those places where there were transitions. So for example, if you go from hex 7FFF to hex 800, and you map out UTF-8 hex in the third column there, you can see that there is a certain range of values, of leading byte and trailing bytes that are overlong. Same thing in the first row, and the same thing for the surrogates here. So the red lines represent boundary conditions which mean state transitions have to be created in the DFA, and for longer sequences, I repeated the surrogates at the top, but you can also have four byte overlong sequences and then of course you can have byte sequences which just take you out of range on the high-end. So, therefore reference if anybody wants to look at. But at the end of the day, the DFA that you get, looks like this. It has nine states. It has states that accounts for single byte sequences, so ASCII are just sequences of transitions in the begin state which is also the accepting state. If you go from begin and you get C2 to DF, go all the way over to continuation state one. If you get a true valid trailing byte, you come all the way back to begin. That represents a valid two octet sequence. Then we have states, partial three bytes states A and B, near the top there, which then feed into continuation state one and back, for three byte sequences and then we have a set of states here, partial state for A and B, and continuation state three and two, which are transitions that are necessary to recognize four byte sequences. So, how can we use this in practice? Let's assume that we have a three byte input sequence, which you see there at the top of the slide. E2, 88 and 85. So we start in the begin state, we look at our lookahead which is E2. We have an outgoing transition on E2, which I've highlighted in yellow here, which leads to continuation state two. So we consume that byte and we go to continuation state two. Our lookahead is now eight-eight. We have an outgoing transition that matches 88 that takes us to continuation state one. Our lookahead is now eight-five. We have an outgoing transition that matches that, it takes us back to the begin state, our accepting state. So, boom. We've just recognized a three byte, or a three octet input sequence of UTF-8 characters. So very straightforward, and what I did not say is that every other outgoing transition from these states implicitly goes to the error state at the lower right-hand corner, which is also an accepting state. But there are a lot more of those, and therefore they are not on the graph. So how can we write a converter to do this? Well, clearly the idea is to write, to do recognition and decoding with a table based DFA, and importantly, we want to do the decoding at the same time we're doing the recognition, rather than recognizing and then going back and decoding. We want to pre-compute as much as possible to get the highest possible performance, but we want to keep our tables small because I'd imagine that this might be useful in some sort of embedded application. I like to keep the code as simple as possible but also make it fast. I'd like to hide all of the complexity of the recognition in the DFA tables, rather than in the code and of course, I'd like to be faster than the other guys. So, I'm gonna show you some code, and there are assumptions associated with the code. First of all, there is no error checking this. I'm assuming that all your pointer arguments are non-null. This is in a sense, an academic exercise. I'm assuming that the input and output buffers that your pointers point to, actually exist. I'm also going to assume, that the destination buffer is large enough that it can receive any output with no overflow. I'm not checking for that in this code. I'm also assuming that the destination code points our little-endian, which they would be on an Intel architecture where we have SSE. I'm assuming that we're going to use Intel hardware and that at least SSE2 is available on that hardware. Finally, I'm going to assume that the destination code point buffer is aligned char32 boundary, which is necessary for the SSE code I'm gonna show you to work. And if I'm transcoding to char16, I'm gonna assume that the output buffer is similarly aligned on a char16 boundary. So here's what the public interface for my decoder looks like. It's a trait style class. Everything in it pretty much is static. Of interest are these three functions, basic convert, fast convert and SSE convert. They take char8_t input params and a char32_t pointer as an output param. That's the output buffer. This ordering of style of arguments was chosen to mirror std copy. So in building the state transition table I'm gonna define a couple of enums so that I can make the table very small. The first is a character class enumeration. So by examining the state DFA, and the tables that I showed you earlier, one can infer that there are 12 distinct character classes and ranging from illegal octets to ASCII to continuation range bytes to bytes which represent leading, leading two byte sequence, leading three byte sequence, and leading byte for four byte sequence. And this is a decomposition of all those transitions into the minimum number of classes that are capable of representing the state transitions in the DFA. I've also got some, scoped type enumeration, which actually represents the states themselves. The DFA has nine classes of input, I'm sorry, 12 classes of input and nine states, and the states map directly to the states that you saw on state transition diagram. I'm using, You may wonder why they are incremented in units of 12. That's because they are 12 character classes and I'm using a linear array, rather than a two dimensional array, to do the look up when I do the state transitions. You'll see why in a moment. I've also got some convenience definitions at the bottom to make the state transition table a little easier to read. I have in terms of data structures you, there is a data structure I call first unit info. This is a little structure that represents special information about the first octet in a sequence. The first octet has to be treated specially. There are things that you do with the first octet that you don't do with the trailing octets. So, I pre-compute, what the value of the code point is by applying the mask that would be, the bits that would be inserted into the code point as you begin the calculation. I also pre-compute, what is the next state given this input symbol or this octet. I also have a set of lookup tables. So there are 256 possible octets. Right? So there are 256 possible first units. So there's a table that represents what to do with each of those possible first units. There is also a table which maps each octet into one of the 12 character classes that I showed you before. And then finally there is the transition table itself that represents the DFA. So, I have a static, I have a static set of lookup tables. I also have some static number functions, which advanced through the DFA. I have a special function which converts runs of ASCII characters using SSE and another function forgetting trailing zeros in a 32-bit word. So, just as a quick example those tables. Here's some example from the first unit table. You can see in the left-hand column in blue these are some input code points. I'm sorry. These the are bits, the sequence of bits which would be masked into a code point, a code point based on their hex values which are in the right-hand column in the, in green. So for example, if the leading octet was hex C3 here at the bottom of the second group, then the sequence of bits, which actually get masked into the code point, are three and my next state in the state transition would be continuation state one because hex C3 is the first byte in a two byte sequence. So by looking up in the table, I immediately know what my bits are that go into the code point, and I also know what my next statement transition is. And then, the next one, as I mentioned the second table maps my input octets to a set of 12 character classes. Obviously, the first half of the table maps to ASCII. In the range from hex 80 hex BF are the continuation range of bytes. And then finally at the high-end in red, these are bytes which can never appear in a valid UTF-8 sequence. And then, in the light blue, medium blue and dark blue we have the categorization into character classes, for leading bytes for two byte sequences three byte sequences and four byte sequences. So I can take 256 possible octets, and given this table find out what is the appropriate character class for that octet. When I do, I get a DFA that looks like this. So what I've done is, in gray are what you so previously, the ranges of octets those edges, and in black are the character classes. So 12 character classes, nine states. Put it all together, you get a transition table, a DFA that looks like this. So, highlighted in yellow are all of the valid, valid entries in this table. And so in a properly formed UTF-8 sequence, we'll start in the begin state, we'll accept any of the ones you see there on the right or if it's ASCII, we'll come back into the end state and we'll work our way through this table. If we ever get input that takes us out of those yellow entries, it becomes an error, which we've marked. And that's why I had the lowercase, to make it easier to read the table. So, how do we do the conversion? Well, here's the basic conversion algorithm for eight to 32 I've got a couple of temporary variables or working variables at the top, and what I want to do is while my input source is not equal to my in pointer, just as if you were doing something with iterators I want to take my input byte, and I want to advance through the DFA with this function that I called Advance, and Advance returns a state, the ending state. And I want, if the ending state is not equal to the error state I'm gonna assign the code point to some output element in the destination. So what does Advance look like? Whoops, sorry. This is a, sort of a map of Advance. We'll go through the pieces in more detail. So at the top we have some working variables. So we will grab the first unit info, based on the octet. This represents the current code unit. This represents the code unit's character class and that's the current DFA state. So, right. So the example. So, let's go back to our example. Suppose we had this input sequence E2, 88 and 85. Our code point value is zero. This is our working value. This is what we're going to build up and we were at this part in the state machine. So I'm going to get the first code unit and I'm going to look it up and get the first unit info descriptor. I'm then going to find out, what's the initial code point value that goes with this input octet, and what is the next state? So in this case, this is a three byte sequence. So I'm going to use the lower four bits of the first octet you see there in the second line, and I'm going to put them in the bottom of the code point, and I'm going to transition to continuation state two. Now I'm at the top of my loop. Now I'm going to loop over the continuation bytes. So I'm going to get my next code unit and advanced my input pointer. I'm going to take those bits that I put into the bottom of the code unit. I'm going to advance them to the left by six bits, because trailing bytes can only contain six bits of information. I'm gonna take the trailing byte, I'm gonna mask off the lower six bits and I'm gonna bit wise order them into the code point. I'm going to then figure out what is the character class for this code unit that I just read and I'm going to figure out what is the state corresponding to my current state and the type of the character class that I just read? So I'm taking the state, which is multiplied by 12, which represents an offset in that array. I'm adding a character class to it, which gives me a specific offset into that long linear table that gets me to my next state. You can see in the middle column there, I've taken 88 a continuation bite, I pulled six bits out of it. I've taken my original four bit, shifted it to the left by six and masked in those six, and that takes me to continuation state one, where I will repeat the process and get a picture that looks like this. This would be my output code point, because I have reached the except state for the sequence. I'm done. I will return curr, which is the state that I ended on, which in this case is the begin state which is also the except state. And, take this code point, the 16-bit, well this is 16 bits, but you can imagine that I've only shown you the lower 16 bits of the code point. But it's actually a 32-bit integer. There are 16 zeros to the left here. This maps hex 2205 which is a character representing the empty set. I'm then going to take that code point, assign it to my output buffer in advance. All right. So how fast is this? Right. Is this useful? So, here's a benchmark. I'll get into more detail later, about what the benchmarks are and what they mean. But this is supposed to be sort of a gut check. Is it worth continuing? So I've got some Wikipedia pages, the Wikipedia page that describes the English language in English, the Wikipedia page which describes the Chinese language Mandarin in Mandarin, at the Wikipedia page which describes the Hindi language in Hindi. I've also got several commonly used libraries four doing conversions. I've got iconv, which is a library from the GNU distribution, which is very popular in Linux and UNIX, actually runs on everything. I got the LLVM converter, taken directly out of the LLVM distribution. In AV this is a converter written by a gentleman name Alexey Vatchenko. I've got std code convert in yellow there. Let's see, the light blue is Boost.Text. Green, a BH. This is a DFA based converter, written by a gentleman named Bjoern Hoehrmann and I should say that, after I had gone through all the effort of working out the DFA, it occurred to me that maybe I could've looked it up on the internet, and I found that someone had actually derived the same DFA about 10 years before I did. So my discovery was 10 years too late. But you can see, and Boost.Text is a very highly optimized converter, that in the case of English, my performance is very close, about 4% worse. But in the case of Chinese and Hindi, I actually got much, I got substantially better performance than anything else and this is written by the way, this is a benchmark running on Linux with GCC 7.2. So, can I make it faster? All right. Let's go back and look at our original basic algorithm. If you look at this algorithm, what's the first thing that stands out at you? I look at this and I see, I look at this and then I look at the next step, which is to go into Advance. What if the first octet is an ASCII octet? Right. Look at all the work I have to do, just to figure out that I'm gonna go into a state machine in the begin state and come back to the begin state immediately. I've done a lot of extra work just to recognize an ASCII octet. So, the obvious optimization is to check the leading octet to see if it's ASCII. But the leading octet is ASCII, I can immediately just assign it into the code unit, and be done. Loop back around and see what I need to do next, and if it's not ASCII, only at that point do I drop into Advance and try to recognize it through the DFA. Do I get any benefit? Well, yeah. I get some. I get a real nice benefit on the English, actually I get a very nice benefit on all three of these, and in fact if you look at it, just a simple converter it is about four times faster than iconv. It's about three times faster than iconv with Chinese, maybe four and three or four times faster with Hindi and it's faster than everything else. Right. Very reasonable approach. Very simple. This is all doable with standard C++ now. It could be done with C++ 98. In fact the code that I've just shown you, could be mechanically translated to C, and used. There's nothing special about it. As I said, lots of code, but very simple code. All the complexity is hidden in the tables, which is nice, except if you're the guy that has to figure out the tables. Okay. (speaking faintly) On the order of about, anywhere from a third to a half. There is a lot of ASCII characters in webpages. But I do have some torture tests where I have 100% three bytes sequences, and 50% three byte sequences. So, let me go because I have a lot of slides to go through. Okay, can I go faster? I hope I can go faster. All right. So, let's look at our ASCII optimized algorithm again. If I look at this, I look for opportunities for optimization. Now I'm looking here at my If statement, which is checking for ASCII in the assignment to a 32-bit code point if it is ASCII. What if I could make that faster? What if I could use 16 byte SSE registers and encode runs or sequences of ASCII characters in one shot. Would that get me any speed up? So, as it turns out, you can. What I'm going to do, is I'm going to bifurcate the algorithm into two. The top half of the algorithm is going to scan for input, everything up to one SSE register's worth of input at the very end of the buffer. I don't want to be, I don't want to be in a buffer where I have eight bytes left but I try to read 16 past the end of the buffer into an SSE register. That's why for the last 15 or fewer bytes I don't use this I drop down to the second half. But for all of the bytes before the last 15, if the first bite that I detect is ASCII I'm gonna drop into his in-line function that I call ConvertAsciiWithSse. Otherwise if it's not an ASCII character, just like before I'm gonna drop into the Advanced algorithm and work through the DFA. The bottom have this function is just the basic algorithm that I showed you before. The bottom half occurs when I have 15 or fewer characters left in the buffer. I can't use SSE, so I just use the ASCII optimized algorithm that we just saw. So the question is, what does ConvertAsciiWithSse with look like? Well, here's an overview. But it's actually quite simple. Kind of looks like a symbol. I'm gonna work through an example with this and just like before, go line by line. So, let's start with an example. Here, at the top I have 16 code points, Greek word and I don't know how to pronounce it, but it's kappa, omicron, sigma, mu, epsilon, and it maps out to these 22 code units. I'm going to read the first 16 code units which are in the darker gray at the bottom, into my SSE register and if you and if you count from the left, if you observe it, and you count from the left, the first 11 are ASCII. The 12th one which is CE, is not ASCII. So keep that number in your mind, 11. All right. So, I'm gonna start. I'm gonna define some convenience variables. So here are some SSE structures. I think a compiler actually treats them as registers. So I just, kind of sideways, call them registers in this. I also have a couple of integers. One for a bit mask and one, which is the increment for advancement and we'll see what that means in a moment. So the first thing I'm going to do, is I'm going to take a register that I call zero, or the interleave register, this is gonna be 16 bytes of zero. There you go. There's a register called zero. I'm zeroing it and setting it to 16 zero bytes. The next thing I'm going to do, is I'm going to do an unaligned load of 16 bytes into an SSE register and I'm gonna call that register, chunk. That's my chunk of data. All right. So here's my pointer to memory, my source buffer. I called this function and I've loaded it into the register that I called, chunk. Next thing I'm gonna do, is I'm gonna call a function called movemask, on the chunk register. And what this does, is this computes a mask in the lower 16 bits of the mask, oh, and I should say on these diagrams, least significant byte and least significant bit are on the left. Most significant byte and most significant bit are on the right. So this intrinsic computes a mask, looking for octets in the register, that had their highest bit set. If the highest bit is set, it is not an ASCII character. So in computing this mask, you can see that the lower 11 bits in the mask are zero, meaning the first, the lower 11 bytes in the chunk, are ASCII and the upper five bits are set to one, because the upper five bytes in that register our greater than ASCII. Okay. This will be important. Okay, so now I'm going to use SSE Intrinsics to do a zero extension. All right, so I'm going to do some unpacking and interleaving. So I call a function called unpacklo_epi8, and what this is going to do, is it going to take the lower eight bits, I'm sorry, the lower eight octets from the chunk register and is going to interleave them with the lower eight octets from the zero register. Here's a result in the register that I call half. So now I've taken the bytes, and I've zero extended them to be words, 16-bit words in the half register. I'm then gonna take those, and I'm going to repeat the process, calling a function called unpacklo_epi16. So here I'm taking them, the lower eight bytes from the half register in zero extending them again into a register that I call quarter. So what I've effectively done, is taken the lower four bytes in my original chunk, and I've zero extended them to 32-bits in using SSE. And I'm going to repeat the same process, for the upper four, for the upper eight bytes in the half register. First I'm going to write them to memory though. Going to write my results memory, and I'm going to unpack again, unpack and interleave. So I unpacked and interleave the upper eight bytes that were in the half register, into the quarter register. I'm going to write that again. So I've just shoved 16 more bytes out into memory and I'm going to repeat this process of interleaving and writing for the original eight bytes in the original registry that I call chunk. So I've now taken these 16 bytes, I've interleaved and zero extended them and I've written them out into memory at 64 bytes. Will now I have to figure out, how far do I advancement pointers. How far do I advanced my input pointer and my output pointer? So I have a function called, GetTrailingZeros. And what does GetTrailingZeros do? It's going to use that mask that we computed at the top of the shelf. Right. On Linux, UNIX, GCC and Clang there is a built-in ctz, which counts the trailing zeros in a word. On Windows there is BitScanForward. What it does, is I have my mask and I compute my increment, my advancement by counting the trailing zeros in that mask. And remember, trailing means from high to low and low is on the left, and high is on the right in these diagrams. So I've counted my trailing zeros. I have 11 of them and I wrote the word 11 because I realized 11 was one-one, and that could also be interpreted as three. Okay, so I know I safely advance from source pointer by 11 which takes me to CE, which is a non-ASCII byte and I also safely advance my destination pointer, my output pointer to a new location in memory. Now you may say, well you wrote five words, 5 dwords in the memory that you didn't need to. You're right, I did. But it's also faster than trying to be smart and only write the number of words that you need. So, go back to my algorithm, I've done my advancement and fall through to the case, I've just come to the point where I'm now at a non-ASCII byte. Remember it was CE. Now I'm gonna to fall into the DFA and continue from there. Is it fast? I like to think it's fast. I got even better results. So now I'm in the order of five or six times faster then iconv in all of these cases, and faster than everybody else by a good margin, for these three test cases, which are basic one byte and three byte cases. I'm hearing myself. (group laughing) All right, so now I'd like to show you some benchmarks. Not just this. I promised that I would explain it. So I did benchmarking on Ubuntu, running on a VM actually on this laptop right here. I ran the benchmarks with GCC 7.2 and Clang 5.0.1. I'll show you the GCC benchmarks. The Clang benchmarks are very similar. There's no surprises there. I'm compiling it for Westmere, which is a fairly old architecture which supports SSE two. Actually it's a very old. It's eight or nine years old by now. I also compiled with Visual Studio 15.4.4 on this laptop, with what I think are the best combination of flags for optimization. My input data. I had several input files which were taken directly from wikipedia.org. As I mentioned, what these represent is the Wikipedia description of a language in that language. (speaking faintly) The HTML page. The HTML page. I have stress_test_0, which is 100,000 ASCII code points, which is not really much of a stress test, but it's intended to be a baseline. I have stress_test_1, which is at 100,000 Chinese code points, which are all three byte sequences. So that's a 300,000 input code units. And I have stress_test_2, which is 50,000 Chinese code points interleaved with 50,000 ASCII code points. So it'd be Chinese code point, ASCII code point, Chinese code point, ASCII code point. And that, if you do the math that works out to 200,000 code units. My reference libraries that I mentioned before inconv, LLVM, Alexey Vatchenko's work, std code convert on those platforms, Boost.Text, compile on those platforms, Bjoern Hoehrmann's code and on Windows, since I thought there might be some local people here, comparison against multibyte to wide char from the Win32 API. The testing methodology is as you might expect, read input file, create an oversized output buffer. Remember, an assumption was that the buffer was large enough to receive the output with no overflow, starting a timer entering the timing loop, performing conversion of the input buffer multiple times. Now these test files that I mentioned to you, they're not all the same size. They range anywhere from 60 kilobytes, to about 220 kilobytes in size. So in order to keep things more or less, apples to apples, I repeated the test for each one, such that a total of one gigabyte of input was processed. So a one megabyte input file would be processed a thousand times. A 100 kilobyte input file would be processed 10,000 times. What at the end of the day, I read in 1 billion code units. Exiting the time loop, of course. Stopping the timer, collecting and collating the results. To pass, every library had to agree with the output from iconv. In fact, all of the libraries had to agree with each other. That was my criteria for making sure that I was right, and the libraries themselves were self consistent. So, results on Linux. You just saw these. This is the slight I just showed you for English, Chinese and Hindi. We see very similar performance results for Portuguese, which is a lot of two byte sequences, Russian which has a lot of three byte sequences and it, and Swedish which is two byte sequences. For stress_test_1, which is 100% ASCII, you see some very nice performance numbers. I mean, it's almost eight times as fast. For stress test one, now here's where we see some interesting behavior for stress_test_1. Remember stress_test_1 is, 100% Chinese code points. So all three of my algorithms do very well. But it's interesting to see that the basic algorithm does the Best Buy about 2%. It does the best because it's only going through the DFA. It's never doing that additional check to see if the first character is ASCII. If the first character is ASCII there, in the sort of red bar there, it's about 2% slower, which is interesting because it gives you an idea a what the cost of that branch is, and running code. And finally the DFA-based approach is very similar performance because is looking to see if the first character is ASCII as well. In this test the ASCII branches are never run because it's 100% three unit sequences. Stress_test_2 is also kind of interesting, in that was the test where we had a Chinese code point, a three octet sequence, followed by ASCII and repeated 150,000 times. So in that case we see a slightly better performance out of the kewb-fast, which is basic, I'm sorry the ASCII optimized algorithm because it finds that one ASCII byte that's interleaved between the two Chinese code points. But the interesting thing is to look at the SSE. This torture test kills the SSE algorithm, because it sees a single ASCII character, and then does all of that work and writes all that, just to come back and realize I only needed to write one bite. All right. But still, even in that case, the SSE optimized algorithm does pretty decent compared to the others. You can also transcode to UTF-16. I didn't show that code. It's very straightforward. Here are some similar results on the same platform, GCC 7.2, on Ubuntu 18.04. Very similar performance characteristics. Here Alexey Vatchenko did not have code in his library, to transcode to UTF-16, which are why there are those empty slots. But all the other libraries would work. And this was all run on the same machine and the numbers are actually a little bit lower at least for the SSE because there are less writes to memory. The interleaving in fact, and that SSE algorithms for UTF, for zero extending to 16, there are half as many interleaved that need to be done and half as many writes, which is why the performance is better. Similar performance characteristics as you saw for UTF-16 and the shape of the graphs, the shape of these bars is about the same for stress_test_1 and stress_test_2, although interestingly the basic algorithm in the SSE algorithm are neck and neck for stress_test_2 the interleaved ones. I don't have a good explanation for that. Okay, so we have some local people here, and I thought looking at windows performance would be exciting and entertaining. So here's the conversion to UTF-32. We got very nice results with SSE through the Windows compiler. I mean from an absolute basis they were better than using GCC running in a VM. A VM running on a quiescent system where I wouldn't expect there was lots of system overhead going on. Very similar shapes of the graphs. No real surprises. Well, except for Boost.Text, and Zach Laine is the author of Boost.Text and he and I are a little puzzled about what's going on here. But the performance characteristics were very similar to what we saw with the, with UTF-32. So, for UTF-16 I wanted to see, there is this Win32 API, multibyte to wide char and it's used in all the system function calls. In the Win32 API there is some conversion whenever an ASCII string is fed as an input, there's an internal conversion that goes on and everything is done as UTF-16 internally. So Microsoft has this API, multibyte to wide char which is really heavily optimized over the years and I wanted to see, could I do better than that? At least on these test cases. Right. That was my benchmark for success. So, here in the third slot, I replaced Alexey Vatchenko's benchmarks and gray, with Win32 multibyte to wide char. So in the English case, we do a little better. Chinese and Hindi case, we do a little bit better. Did similar performance, improvements for Portuguese and Swedish which are two byte cases and also for Russian which a lot of three byte sequences and it. And then finally on the torture test, well stress_test_0, which is no torture at all, I did better, for stress_test_1 which was 100% Chinese. All three of my algorithms were very close, while the basic and the SSE were very close to multibyte two wide char there with 983, and for stress_test_2 I beat Microsoft in two out of three. So I'm gonna put a mark in the win column for that. So, thinking about reusing the library. Within the next few weeks I'll put a permissive license on this and make the code open source. But I just want to remind you that the air handling in this was intentionally limited. The interface is intentionally small, so was doing this as a proof of concept. I actually have, if you look at the code, two different table-based advanced algorithms. I showed you the algorithm for the big table, which requires 876 bytes and 14 cache lines. There's also a small table version which is a little bit slower they can fit into 380 bytes or six cache lines. I imagine this actually being reused either as a library. I really imagine people just taking it, and cutting and pasting it into their own code because they will be better able to optimize it then I can by providing a library. There's only a trivial mechanism for reporting errors. There's no checking as I said done for no pointer requirements. And these are the requirements, the caveats that I mentioned before. In terms of future directions, I've been doing a little research and it seems like AVX-2 and AVX-512 will allow me to do faster zero extension with fewer intrinsics calls. So I'm gonna experiment with that in the future. I actually don't have access yet to anything that has AVX-512 on it. All of the conversions that use all right now were to little-endian. But one could imagine adding intrinsics, using intrinsics to flip them to big-endian if that's what the consumer desires on the output end. There should be a validate method, which measures length kind of like strlen, except for UCF-8 sequences. I'd like to provide member function templates that take iterators rather than using pointers as the interface does now. Input and output iterators could probably be used for some sort of non-air handling an ASCII optimized algorithms. Input output iterators since they're not random access, I don't think I could rely on being able to use them with SSE. And yeah, I just made that point. Okay. I think there should be four argument versions so that you can do checking of your destination buffer to make sure that you've got enough room, with error checking for out of bound rights. I like to provide meaningful error reporting as I just mentioned, the type of error and where it occurred relative to the input and also provide some commonly used mechanisms for error recovery and some of the most common ones are to just stop and return an error or throw an exception immediately to entirely skip defective ranges of code units until you find a sequence of code units that's valid or to replace in the output, defective ranges of code units with some representative code point that represents an error. If you've ever seen things where you have the little diamond with a question mark in it, that's where somebody substituted in valid UTF-8 with this character which represents that in the output. So in summary, sometimes it pays to re-examine the algorithms and data structures that we use from a different perspective. Yeah, I spend a lot of time fooling around with these algorithms and changing things around in trying to optimize them and at the end of the day I realized, it's very hard to outsmart the compiler. It's usually pretty smarter about stuff, but in doing optimization is very important to build your benchmarks and test a lot. Test with multiple compilers because I would do things that would get me better performance with GCC and would make performance worse, say with Visual Studio or Clang. And do it on multiple operating systems if you can, on multiple hardware platforms. I actually ran these tests that you saw, the results, the graphs that I generated from this laptop. But I ran it on four or five different Intel chips. And I saw very similar performance across all of them. And finally, last point. Savor your victories. It's not very often you do something easy and you catch a big fish. So enjoy it when you can. So I'm going to put the talk up on my blog today, the slides and the code, I'm sorry, up on GitHub and there's a link to my blog. So, thank you. Any questions? (audience clapping) - [Man] Hi, great stuff. - Thank you. - [Man] Since you're handling ASCII out of band, not as part of your table, did you consider taking those transitions out of your table, and would it make that part go faster? - That's a good question. If I had taken them out of the table I would reduce from 14 cache lines to 12 cache lines. Would that have made a difference? I don't really know, because all of it as it is now his 896 bytes, in terms of cache lines. I did not try it. I didn't think about that. It would reduce the table size. My intuition tells me it would have zero effect on performance. But it's a good experiment to think about trying. - [Man] I've been messing around a little bit with Lexer generator library and or tool, and one of the learnings from that was that if you expose the transition table to a compiler in the form of code, and set of data as a table sometimes it could produce faster code because it's, the optimizer sees the logic, you know, and I was wondering if you'd considered experimenting with that, with may be like one of the boosts state machine libraries or something. - Not until this very moment. - [Man] Okay. - But that's a good suggestion as well. Thank you. - [Man] Along those lines, context might also help in terms of when can it do aggressive optimization. - Yes, Jason Turner has chastised me continuously for not making everything in this context. - [Man] By the way, the Greek word is Kosmeme. It means, "We Are." - Kosmeme. Okay. Thank you. Any other questions? - [Man] You said you played around with making the SSE version do something different when there are a lot of bad non-ASCII chars in it. Can you go into more detail about what you tried and how much worse it made the file for the ASCII case. - I don't remember what I said. I did try couple of experiments to avoid going into the SSE, to using SSE. One was to look and see if I had two ASCII characters in a row, rather than one, and surprisingly the extra branch for doing that test didn't improve, It actually degraded performance. The other experiments, I mean I did experiments with SSE, since I chose SSE2 as my base, my foundation for the intrinsics that I would support. I had very limited, there were very limited intrinsics for doing the zero extension. Basically I had to do this silly zero register and interleave. When you get to AVX and AVX2, there are intrinsics you can use that just, do it for you in a single intrinsic. Which is why I think, that's my next thing that I'll experiment with because I do the zero extension, much more quickly. If you noticed, there was a big difference in performance between the UTF-32 and UTF-16 versions because there were half as many intrinsics calls and half as many writes to memory for 16 as there are for 32. I think in the case with the AVX2 and 512 intrinsics, I might be able to half that again, and with AVX-512 you can do even more at once. Oh, and in terms, I think I know what you're getting at. So what I played with was, Give me just a second and I'll get to that slide. I think I know what you're asking. - [Man] And since you mentioned the AVX-512 actually worried me, that if you go to 512 your stress test might get a little worse 'cause it's a bigger block size, it'll do more work that's thrown away. - Right. Is that it? Okay, here at this part at the bottom of the code in the SSE conversion. This is what I played with. Now, you see that there is a branch there. Doing this branch, looking for you zero mask and incrementing by the literals is faster than always getting the trailing zeros from the mask, computing in increment and then adding that in. So this extra branch, so that it can increment by the literals in the case where I have 16 ASCII characters, actually give me a nice three or 4% performance boost in this case. I did play with for example, the placement of, you know, I first tried to take the, to get trailing zeros and was outside of this If statement which was a cost I didn't have to pay every time and experimenting with this led me to this discovery that incrementing by literals is very quick. That was one of the things that I played with. That may have been what you were getting at. Placement of things, the order of operations inside, the order of those SSE operations. So the order that I have is, you know, sort of like simulating in Erlang. I think I'm close to the global, or local minimum but I'm not quite sure. - [Man] Have you tried to do the DFA itself vectorized? - So, there are people that have experimented with vector rising DFAs and vector rising DFAs to do recognition. But I wanted to do, once you do the recognition, you still have to build the code point, and I don't see how, I'm not yet figured out were aware how one can build the code point in a vectorized fashion at the same as you're doing the recognition. - [Man] We usually leave things in UTF-8, so just the recognition would be variable. - Yeah, will the recognition, yeah, doing the recognition in the DFA is probably possible. - [Man] Thank you. - Yeah. - [Man] This is more of a comment related to the material. If you haven't read it, like everybody in the room should read, Joel Spolsky's 2003 article, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets. You really should read that, if you haven't. - Okay. All right, anything else? All right. Thank you very much for coming. Hope you got something out of it. (audience clapping)
Info
Channel: CppCon
Views: 10,607
Rating: undefined out of 5
Keywords: Bob Steagall, CppCon 2018, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording
Id: 5FQ87-Ecb-A
Channel Id: undefined
Length: 69min 40sec (4180 seconds)
Published: Thu Oct 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.