CppCon 2017: Carl Cook โ€œWhen a Microsecond Is an Eternity: High Performance Trading Systems in C++โ€

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

This is the sort of stuff I love (not scamming the stock market). I really love optimising code to the instruction level. My capstone school project I had to sample an analog-to-digital converter 5 million times per second. But my microprocessor was only 80Mhz (80 million instructions per second if you had branching that the cpu guessed right all the time). Step 1 - buy a better compiler or use assembly because the first thing the free C compiler did in an interrupt was push all 8 registers to the stack, then pop them at the end. There goes more free instructions that you had right there. Writing and tuning that interrupt handler was one of the funnest parts of the project.

๐Ÿ‘๏ธŽ︎ 189 ๐Ÿ‘ค๏ธŽ︎ u/browner87 ๐Ÿ“…๏ธŽ︎ Oct 09 2017 ๐Ÿ—ซ︎ replies

I think it's a bit worrying that a lot of smart people are putting their effort into things like this. Physicists, mathematicians, engineers, the like, are all hired up by finance, and programmers are working for dumb start ups or evil tech giants.

Do we really need this? Don't you think we should be putting our efforts towards more important things?

๐Ÿ‘๏ธŽ︎ 156 ๐Ÿ‘ค๏ธŽ︎ u/[deleted] ๐Ÿ“…๏ธŽ︎ Oct 09 2017 ๐Ÿ—ซ︎ replies

Irrespective of the application, there is a benefit to efficiency in terms of power, cooling, etc. Which is one reason why not all the work being in this area is on the financial side -- think Facebook, Google, and others with very large computing needs, scientific calculations, weather forecasting, etc.

For another angle, see presentations by Bryce Adelstein Lelbach.

๐Ÿ‘๏ธŽ︎ 3 ๐Ÿ‘ค๏ธŽ︎ u/WallStProg ๐Ÿ“…๏ธŽ︎ Oct 09 2017 ๐Ÿ—ซ︎ replies

While I appreciate the technological challenge, this seems about as ethical as making weapons or drugs. Instead of killing people, you take their livelihood through financial trickery that really should be outright banned. The financial markets should in the end bring a net-gain to the economy, not be mostly vapourware that sucks out money to give to the top 0.1%. That is what they are supposed to be: Service providers.

When I loan you $10'000 so you can start a business, I'm doing something sensible through a financial instrument. You get access to resources, and I get a cut of the profits later. That is what makes the free market such a powerful tool. But when I do HFT, I'm just taking a cut from a trade that two other parties wanted to make by forcing myself into the middle. I think one could go so far as to make a reasonable argument that this is theft, nothing more, and that is why I think it should be completely banned (and technologically prevented, by adding a short delay to every trade: This would be irrelevant to anyone else, but utterly destroy this parasitic cancerous tumour of an "industry").

And don't give me that "they provide fluidity" horseshit pseudo-argument. The markets are exactly as fluid with or without spamming 1-cent trade arbitrage in the middle (which should be fucking obvious, right?). The fluidity that's argued for is just a measuring artefact.

As a programmer, I will not work for such companies unless it's that or not finding any other job at all (after all I'd rather be unethical than starve to death). Arguably anyone who can write this kind of tech could find a decent job instead of becoming a legal thief.

Edit: Every time I point this out, a couple of HF-traders show up and try to explain to me how I'm wrong. I get it, you don't like getting called out. Doesn't mean I'm not right. If there was any value in HFT, even a non-trader could easily tell me what that value is, just like I can tell you what the value of even the most complex improvement in chip design is: "faster computers". One does not need a PhD to understand the value. But if there is no value, of course nobody can understand it, and of course people who make a ton of money off it will invent some impossible to fact-check explanation that sounds good.

Edit2: Here someone gives an explanation of how it works, trying to defend HFT: https://www.reddit.com/r/programming/comments/7542zx/cppcon_2017_carl_cook_when_a_microsecond_is_an/do49a1t/ Funnily enough it is literally the same example that I use in my second paragraph. Well fucking go figure: It's a parasitic man-in-the-middle, exactly as I said.

๐Ÿ‘๏ธŽ︎ 117 ๐Ÿ‘ค๏ธŽ︎ u/chocolate_jellyfish ๐Ÿ“…๏ธŽ︎ Oct 08 2017 ๐Ÿ—ซ︎ replies

Without reading the article... I interviewed for, and was actually offered, a job "inside the Chicago Loop" which I guess is prestigious, which would have had me solely responsible for running/maintaining/writing/upgrading software for high-performance, big-dollar trading, running on VAX/VMS systems. If I'd taken it, the salary in 1997 would have been more than I've ever made since, and undoubtedly would have grown from there. They guaranteed I'd make partner in eight years.

I turned them down, for three reasons:

1) Any downtime would cost the company tens of thousands of dollars per minute (or maybe per second, which would have been even worse), and I know myself well enough to know that I can't make good software fixes fast; I didn't want the responsibility.

2) I would have had to live in Chicago, very much not anywhere near the top of my list of places I'd like to live, at least not at that time. We went so far as to have a realtor take us around to look at houses, but nothing within our price range was even half as nice as what we were already living in in Rochester, NY for a fraction of the price; and at one house they apparently hadn't correctly communicated to the residents that we were coming to look at the house, because we opened a bedroom door and there were like five people sleeping in there. The whole house-touring experience just had a weird/creepy vibe.

3) I don't understand finance/trading AT ALL and would have been pins-and-needles the whole time.

๐Ÿ‘๏ธŽ︎ 8 ๐Ÿ‘ค๏ธŽ︎ u/redweasel ๐Ÿ“…๏ธŽ︎ Oct 09 2017 ๐Ÿ—ซ︎ replies

I really enjoy these types of talks. Thanks for sharing.

Anyone has a list of talks/books that deal with similar this kind of low level performance centric code in C/C++? (cache friendliness, inlining, access alignment...etc)?

๐Ÿ‘๏ธŽ︎ 1 ๐Ÿ‘ค๏ธŽ︎ u/moshohayeb ๐Ÿ“…๏ธŽ︎ Oct 10 2017 ๐Ÿ—ซ︎ replies
Captions
- Cool so yeah, good afternoon everyone. My name is Carl, I work for a company called Optiver. I work writing automated trading systems and I've done that for about 10 years in total. So today's talk is just going to be basically a day in the life of a developer of high frequency trading systems. I'm also a member of SG14, so I try to give a little bit of feedback from the trading community into their standards. What's important from the low latency perspective from trading. But I'm definitely not a C++ expert. There's people in this room who will know 10 times more about C++ than me. It's not necessarily a problem though. You just need to be really good at measurement. You need to be really good at applying scientific method. You need to understand the tools. If you can do that and you can measure and you can measure well, you can write very very fast software without knowing the standard inside out. So I'm gonna give a very quick overview of what we do. Won't go into that in much detail, 'cuz that can take an hour or two just by itself. But I will talk about the technical challenges that we face. Those two topics are just an introductory, then we get into the real stuff, which is the actual techniques that I use on a daily basis for writing fast C++. Mention some surprises, some war stories, and then finally I'm gonna discuss how to measure C++ and that's probably the most important few slides of the talk. It's really all about measurement if you want fast code. Cool. Another thing is that I would absolutely love to be able to discuss absolutely every optimization trick out there, but I can't. Just not enough time. I just did a run through of my slides this morning, threw away about half of them. So yeah, this is only a very very brief sample into optimization techniques for low latency. So low latency, not high frequent, not trying to process 60, 100 frames a second, but just trying to be very very fast when you need to be fast. So what do we do? Well most market makers are providing liquidity to the market. That basically means we're providing prices to the market to say we're willing to buy this instrument for a certain amount, we're willing to sell it for a certain amount more than that. By instruments I mean a tradable product, a stock, a future option. It's just pure coincidence that there's instruments on the table over there. Not those sorts of instruments. This has some challenges. You need to be moving your prices quite regularly. If you don't move your prices you'll miss out on the trade or even worse you'll get the trade but it wasn't what you're looking for 'cuz you didn't move your price fast enough and someone else has lifted you on that. So what market makers do is they try to just make the difference between buying and selling. So they want to buy a stock and sell it again. Sell it, buy it. That will hopefully balance itself out. So market makers aren't quantitative funds or hedge funds or anything like that. They're not looking to build a position with the intention that the market's going to go up or down, they're really just trying to make a small amount of profit by buying and selling fairly frequently. Ultimately this always comes down to two things: buying low, selling high. Yeah there's some complex maths in there. Sometimes it gets a little bit complex, but ultimately that's what all trading algorithms come down to. And success means being just slightly faster than your competitors. It doesn't need to be by a second, it could be a picosecond, just as long as you're faster. I think this is quite a nice picture actually because it gives you an idea. No one really remembers who comes second or third or fourth in a race right? And that's kind of the same thing with electronic market making. You really need to be the guy out in front. But, safety first. A lot can go wrong in a very very short period of time. So you need to be able to detect errors and get out, not the other way around. You don't want to think about things then get out of the market, you want to get out then figure out what's gone wrong, because a lot can go wrong in a few seconds. That's best automated obviously. That's a good quote. Okay, so I'm gonna talk about the hotpath a lot, sometimes I'll say fastpath, but it's the same thing. This is the code which sends a message from the exchange, decodes it, figures out that this is an interesting event, this is a price change, this is a trade, this is something, a market open event. Executes an autotrading strategy, decides it wants to trade, runs the risk checks, makes sure that that's okay, and then finally sends an order to the exchange to say I'd like to trade or I'd like to change my price or something like that. That's the hotpath. Characteristics of the hotpath are quite interesting. Probably one to five percent of the total code base and it gets executed very very infrequently. Maybe .01% of the time you'll actually have that hotpath fully execute. Now that's at odds with the operating system, the hardware, the network, basically everything in the room is at odds with trying to execute a hotpath. Not very frequently, but when it executes very very fast very little latency. So even networks for example are based on fairness, which makes sense. They want to cache up bytes until there's enough for the packet to be fully sent, then the packets sent and make sure that no one sender is saturating the network. The same with operating systems, they're all about multitasking and cooperative multitasking. These are things that you don't actually want from a low latency perspective. One final point: jitter. You do want to be fast most of the time, but sometimes you're slow. So if you're fast four times out of five, that's great, but if that fifth time you're so slow that you end up with a bad trade because you didn't move your price or you missed out, well if you miss out you miss an opportunity, but if you get hit on a bad order or bad trade, that can be quite expensive. So it's very frustrating. You can have code which runs faster on average a lower median, but the standard deviation is too wide and that's actually not a good thing. You prefer a tighter deviation than an actual faster median. So where does C++ come into this? Well it's the language of choice within trading companies and that's no coincidence. It's low cost. You get basically zero cost abstractions. You can get your hands on the hardware reasonably, but it's still a relatively abstract language. That's great, that's what we want. But there's some catches. Your compiler, and compiler version, build link flags, different libraries that you're using, the machine architecture, these are all going to impact your code. And they will unfortunately have interactions with each other as well. So you really need to not just know C++ well, but figure out what the compiler is actually generating. Anyone know an app for that? Very very useful. No coincidence that this comes from a trading company. You'll learn more about that on Friday I believe. Tools like this are great. You can change the compiler flags, see what happens, have a look at what the actual resulting assembly is, change your versions, compare client to GCC, it's great. See which version of the compiler actually made a change to your code if you notice a regression. Now I should point out as well, I don't know how to write assembly code. But I do know how to read a little bit of assembly code and that's great. So from a performance point of view, I know that a call or a jump is going to be expensive. I know that more machine code is probably more expensive than less machine code. I know that if I have fewer jumps, more conditional moves, and things like that, we're onto a good thing here. Now just to talk a tiny bit about system tuning. This is a C++ conference, so I can't go into this in too much detail, but it's worth just pointing this out. So you can have exactly the same hardware, exactly the same operating system, the same binary, the same everything, but put them on a slightly different configuration from a BIOS point of view or a Linux kernel tuning point of view, you get different results. So the blue server is a machine that is not running hyper threading. The orange server is a machine that is running hyper threading. This is sorting different size vectors of random integers. Why is the hyper threading ... Oh this is a single threaded (mumbles) running on one CPU. Why would the hyper threading slow us down here? What's common between the two threads on a core? If you're hyper threading. Cache. Exactly. So you're sharing the same cache. So your cache either goes down to 50% or more likely it goes down to zero because there's something else running on the second thread. Hyper threading is great, I'm not here to bag on hyper threading, I use it at home on my desktop, why not? It turns four cores into eight. But for a production server running low latency trading software, things like that will catch you up. You don't need to know about this from a C++ development point of view, but you better hope that there's someone in your company who can take care of that for you, otherwise you've already lost unfortunately. So how fast is fast? The Burj Khalifa in Dubai 2,700 feet tall. Very tall building, tallest in the world at the moment. Speed of light coincidentally, is around about one foot per nanosecond. So if we developed a autotrading system that ran end to end, wire to wire, from seeing the packet in to performing of the logic and sending a packet back out to the exchange in say two and a half microseconds, well that's actually less time than it takes for light to travel from the top of the spire of this building to the ground. That's not a lot of time. So when you go oh well it's alright, the machine might be swapping a little bit at the moment, or we'll go out to main memory or it doesn't matter that someone else is using our cache at the moment, well it does. All bets are off at this point. You have lost the game at this point. There's such little time to actually execute your code. Okay so, that's the introductions done. I've got about 10 to 15 coding examples of just low latency techniques that have actually helped me in production make the code faster. This isn't just micro benchmarking, it's and yeah this appears faster, this is actual code that's been tested in production and it's made a serious difference. Most of these will be somewhat obvious to some people in the room, other people maybe not. So don't be too worried if some things you already know. Hopefully there's a few surprises in there for you still. So this gave me about a 100, 200 nanosecond speed up a couple months ago when I did it, which was removing the code on the left and replacing it with the code on the right. Why would this be faster? Mhm. So indeed, the compiler can optimize the hotpath better. What else? Yes, two answers at once and they were both super and they were both correct. One said cache production, one said branch production I think. Well cache and branching, yeah so, branch production, there's fewer branches on the right hand side for the hardware branch predictor to deal with. And cache, this is good, on the right is it better for your instruction cache or your data cache? Both? I don't think he said both but (laughs) I claim it's both. Yeah, you've got fewer instructions, less pressure on the instruction cache. And also who knows what these error handling functions are actually doing? Maybe they're trampling your data cache as well. Whereas now we've only got one integer that we're actually reading and writing to and machine architectures are incredibly good at testing if an integer is zero or nonzero. Does this seem obvious or ... ? After I did it, it was like oh why am I not doing this all the time? But yeah, just things like this can really make a difference. Hi. Yeah yeah yeah. So where is the error checking code? So the error checking code is behind the scenes. It's miles away. But if there's any flag that has been set by anything before heading this hotpath, we'll set a flag to say do not go any further. Yeah. Then in this handle error, that can figure out what the actual error is, deal with it. Does that kind of answer your question? Yeah. Okay another quite simple example. I see this quite a bit, which is template-based configuration. Well, the lack of template-based configuration. Sometimes you don't know everything at compile time, so maybe you can read from configuration and based on that you're going to instantiate one type of class for another type of class, but you don't know what class you're actually going to instantiate. Most people will use virtual functions for this. Virtual functions are absolutely fine, if you don't know what class you're going to instantiate. If you don't know all of the possible classes. If you know the complete set of things that may be instantiated, then you don't really need virtual functions. You can just use templates for this. And again this is a very simple example, but I think it's often overlooked and I see virtual calls going all the way through code and into the hotpath and it is slow and it is not required. You see this all the time in the ASDL and it works really really well. So here we've got two classes, and both send orders; order sender one, order sender two, A and B. And we've got an order manager which uses one of those two types. We don't know which one and it doesn't know which one, it's templated at this point in time. Now there is a virtual function core in there, the mail loop, but that's just to start the program so that virtual call is gonna disappear before you know about it. But the actual calls to this order sender and nonvirtual concrete time. How do you hook this up? Just use the factory function, yeah? Look at your configuration and then instantiate the correct type, pass it in, and you're done. No need for virtuals. Again, this is not a ... I won't win any awards with this slide, but so many times I've noticed that people miss these sorts of things. You know everything at compile time, you know the complete set of things that can happen, use that information to make your code fast. No virtuals, faster code, more compact code, more chance for the optimizer to optimize, less pressure on the instruction cache, less pressure on the data cache. It's a good thing. And the next one, Lambda functions, this will not give you a speed up. This slide will not give you a speed up, but it's a nice example of how C++ can be fast and powerful, but still relatively expressive as well. So on the left we've got a function called send message, takes in a lambda, prepares a message, invokes the lambda, sends the message. On the right is the actual usage. So here we're calling it send message, we need to pass it a lambda, so here's the lambda, the lambda takes a message and then it just populates this. Now, this could very well be as follows: send message, on the left, the prepare message isn't allocating memory or anything like that, it's just getting the network cards buffer, via DMA, and returning that. The lambda will definitely get inlined at two moves, which is the slow level as you can get. Send could be as simple as flipping one single bit on the network card to tell the network card to send. So that's very very low level, very powerful, very fast, and in fact, the entire function there on the left, send message, will probably get inlined as well, depending on what you're doing with it. So it's very very nice, very very fast, but still relatively high level language. So you can see why C++ is somewhat the language of choice here. Quick show of hands, who knows that memory allocation is slow? Good. So that should be no surprise to you. Use a pool of objects. I don't care how it's done, you can use an allocator, you can do it by hand, I don't mind. But definitely try to reuse and recycle objects. Be careful with distracting objects as well, this can get expensive. Underneath all of this will be a call to free. Free's actually relatively expensive. If you go and have a look through glibc, it's about 400 lines of code. Yeah you're not going to hit every single line of code each time, but even the act of deleting objects can be relatively expensive, so do watch that. Okay, exceptions. Are exceptions slow? If you don't drop into them. And throw, yeah. So absolutely if you have an exception that's thrown, from my measurement it's one or two micros, but as far as I can tell, and I've measured this a lot, I've really tried to measure this in many many different ways. Exceptions claims to be a zero cost thing as long as they don't throw. As far as I can tell, that is absolutely true. So don't be afraid of exceptions. Of course, if they throw, it's gonna be expensive, but if an exception is thrown you've got a problem anyway. You're not going to be sending an order anymore. So don't use exceptions for control flow, yeah? One, it's gonna be slow, and two, your code's gonna look really really bad. This is a little bit similar to a previous slide. You don't really need if statements. Particularly in your hotpath. Branching is bad, branching is expensive, and you can remove quite a bit of it. So here we've got a branching approach of running a strategy where we calculate a price, check some risk limits, and then send the order. And if you look at the implementation of one of these methods, here's our branching at the side, and so by then we want to ask for a bit of a discount, otherwise it must be a sale, so we're asking for a little bit of a premium. That's pretty simple. This should run relatively fast. But we know at compile time, it can either only be a buy or a sell, there's no third value. And we know what we want to do in both cases. We know that we want to either ask for a premium or ask for a discount. So you can just template specialize this completely away to the point where you have just completely deterministic, nonbranching code. Don't go crazy with this. Don't remove every single if from your code, it will look really bad, and it'll probably end up slowing your code down. But in the hotpath you can go a long way with absolutely removing branches completely from your code. Hi. - [Audience Member] (distant voice) - The question was, can't the optimizer optimize this out? In theory, I guess it probably can. In practice, it normally doesn't. And actually there is a very very good point. So much of this is just guaranteeing that the compiler is going to do what you expect it to do. You're kind of locking the compiler in to say hey I really want you to know that there's only two particular branches here or two particular conditions. So a lot of it is just basically taking the risk out of it and being very very concrete with telling the compiler what you want to do. Multi-threading. I don't like multi-threading. I know that you need it sometimes. I know that you don't want to be doing everything in the hotpath. I know that you need to sometimes maybe have a second thread on a different CPU or a different process on a second CPU, second core, doing some heavy lifting, doing some calculations, responding to messages and things like that. Okay I get that, you need multi-threading, but keep the data that you share between the hotpath and everything else to the absolute minimum. If you can do that, you're gonna get away with faster code than otherwise. So consider not even sharing data. Consider just throwing copies of data across from the producer to the consumer. May be a lot for a single writer single reader queue. Yeah you're still sharing a little bit there, you're sharing the atomic read and write pointers for the queue, but that's actually not much data to share. I think that's a reasonable approach. Of course sometimes you don't even need to worry about multithreading, sometimes ... Sorry, protection over multithreading. Sometimes receiving out of order is actually okay, it's not the end of the world. Sometimes the machine architecture is so strong that it'll actually avoid the risk conditions that you've read about anyway. So it's worth experimenting with that. So we're about halfway through now I think of the tips and tricks section. The next one is data lookups. So if you read a software engineering textbook, it'll probably suggest that if you have an instrument that you want to trade on, and a market, and each instrument has one market, and a market can have many instruments. So you can see the ER diagram forming. You'll do a lookup, so you go okay, this code here would make sense, so you create your message that you want to send to the exchange and you'll go look up the market and you'll get the information out of the market and that will be part of your computation. So that's fine. That's not a problem. But you can do things a lot faster than that. If you're going to read the instrument, if you're going to read the current price of the instrument, how many bytes have you read there? How big's the float? Four bytes? Okay, so how many bytes have we read? 64. Why have we read 64? Because we've read a cacheline. We can't do anything about that, that's how the machine works. So, why not put the information that you need into that cacheline? So if you know that you always need to read say for example the quantity multiplier from the market and that hardly ever changes, well may not change at all that day, put it into the instrument. You're going to be reading that cacheline anyway, and then you get that for free. No further lookups, you're done. It's denormalized, might violate a few software engineering principles, but that's okay. You're gonna get faster code. Containers. Containers are worth a handful of slides, so I've got about four slides on unordered map. So you need something to do lookups on. Sometimes you are gonna have to look up data. So that's fine. So this is your likely implementation of an unordered map. You have multiple buckets, you have a key, the key gets hashed, the hash wall corresponds to one and only one bucket, and your key and value pair will go into the bucket. And if you get multiple pairs of keys and values that map to the same bucket, then you'll get a chain effect. So this will be one link list underneath the hood. Default max load factor of one. So on average, there will be one item per bucket. Order one insert, order one find. Unordered map is kind of your go to map if you want fast containers within C++ and you want to just use CSDL. And by the way, the original proposal was in the bottom right, really worth the read. It's a really good description of how unordered maps and unordered sets are implemented. Link lists are not guaranteed to be in contiguous memory. So you really don't want to be jumping around to this link list looking for your key. You want to find it the first time. So, I did an experiment recently where I took 10,000 ... Well I generated 10,000 keys and I generated them from the range of zero through to one trillion. Then I inserted all of them into a unordered map. Then I asked H-node how many other nodes are in your bucket? And this is the distribution that I got back. Most nodes are in a bucket all by themselves, but only just. You will get collisions. This was a uniform distribution, you are going to get collisions. Unless you have a perfect hashing function. Which in this case I deliberately didn't have, but I had a pretty good hashing function. Used standard uniform int distribution. So if we go back, you will from time to time have to run through your nodes to find the key. It's not always going to be the first one. So let's time this, this is Google Benchmark for the top half of the slide. Anyone use Google Benchmark? Yeah it's good. Good to see. If you haven't already, check it out it's fantastic for paramaterizing micro benchmarks. 14 nanoseconds through to 24 nanoseconds, so you can see that's reasonable. And you can see that speed of a find is broadly related to the number of elements in the collection. There's the Linux Perf output in the second half of the slide there. Nothing too surprising in there. Instructions per cycle was a little bit low, but there's also quite a few bit of stalling going on there as well. And if you do the math, that means that broadly one out of 500 cache references is actually a cache miss. So so. Or you could use something like Google's dense hash map. Fast. A little bit of complexity around management of collisions, but that's okay. So, given the choice I'd go for this. Or you can ignore Bjarne's suggestion this morning if not reimplementing anything in ECR relation to hash tables and take a hybrid, take the best of both worlds. That's what we do at Optiver, for a lot of our (mumbles) instead of code, is we have a slightly different hash table. Lots of different implementations of hash tables, plenty on the internet, pick your best one. What I'm going to do is just quickly describe what we use at Optiver, 'cuz it's quite a neat approach. Maybe you do something similar at your company, I don't know. But we have something like this. So the keys and values are just anywhere, they could be in a separate collection, or it could just be heap allocated. It really doesn't matter too much. The table itself is actually a precomputed hash and a pointer to the object. That's your hash table. So it's like a hash table of metadata. Importantly, the hash, eight bytes. The pointer, eight bytes. So each pair is 16 bytes. You can fit four of those in a cacheline. Which means that when you read one, you've read another three more than likely as well. Which means that you've already pre fetched in any subsequent data that you might need to resolve the conflict. So we take an example, if we look for key 73, if this is an integer key, then well the hash is just 73 as well, maybe this index is to index one. Go to index one, the hash is 73. Okay great, we probably found what we're looking for, follow the pointer, yeah the key is 73 as well. Okay, great, we're done. If we take another example, key 12, maps to hash 12, maybe this thing happens to index slot number three, based on the implementation. Lets check the hash, the has is 98, ah okay, so there's a collision. That's alright, we'll just follow it to the right, what's the next hash value? Okay, that's hash value 12, great that's what we're looking for, follow that pointer, yeah, there's our key, there's our value. Very cache efficient. Don't take my word for it. So if we micro benchmark this, twice as fast, seems to be less susceptible to the size of the collection. Instructions per cycle are looking a lot healthier now, double if not triple. And only that one out of 1,500 cache references is actually a cache miss. So just a more cache friendly implementation of a hash table. There may be a talk later today about hash tables like this possibly by ... - [Audience Member] (indistinct) - Sorry? - [Audience Member] Wednesday. - It's Wednesday. - [Audience Member] Yeah. - Cheers. There may be a talk on Wednesday about this. Okay a couple more to go. Always inline and no inline. The inline keyword is incredibly confusing. Inline does not mean please inline, inline means there may be multiple definitions, that's okay, link it please, overlook it. If you want something to be inlined, you're far better to use GCC and Clang's attributes to say force this to be inlined please. Or force this to not be inlined. Now, dragons lie here. You need to be careful about this. Inlining can make your code faster, inlining can make your code slower. Not inlining can also give you either of those two alternatives. Again, you need to measure. Look at the disassembly, but even better, actually measure end production. So just a trivial example here, but we go to check the market and if we decide that we're not going to send the order 'cuz something's gone wrong, do some logging, otherwise send the order. Now if this is inlined, if the compiler goes hey that's not a bad function to inline, it doesn't take too many parameters, yeah sure lets inline it, why not? You're gonna pollute your instruction cache. You've got all of this logging code, which you don't really want in the hotpath, right there in the middle of the hotpath. So this is an easy one, just no inline this, that'll drop out, that'll turn to a call, chances are the branch predictor will ... Well two things, chances are the compiler will pick the right branch, and even better at runtime the branch predictor will probably be picking the right branch all the time for you. Gonna talk about the branch predictor more very soon by the way. The hardware branch predictor. I'm gonna talk about the hardware branch predictor right now as it turns out. Okay, this is the most important slide out of the 15 or so tips and tricks. This gives me a five microsecond speed up on my code. Or in other words, whenever I screw this up, the systems run five microseconds slower. So much in fact, I have a flow chart on my desk which basically goes oh you have five markers you're going slower. Yes. Did you screw up? Dump your orders. Yes. Fix it, and go back to the start. So this is what happens in practice. You hardly ever execute the fastpath, the hotpath, and I said this right at the start. Gonna say it again, the fastpath is very seldomly executed. Maybe you get close and then you decide actually I can't because of risk limits, the next market data that comes in isn't interesting, the next is, but again the strategy decides I've already got enough of that instrument, I don't want to trade it. Then eventually, at some point ah great there's an order we want to go for, let's shoot for it. Now on top of this there's other things going on which I haven't put on the board, on the graph, things like the unordered map that's trampling all over cache and other things that are trampling all over our cache, handling other messages that are coming in, doing background processing. So how do we fix this? How do we keep the cache hot? Well, we pretend we live in a different universe where everything that we do results in an order being sent to the exchange. Here's a tip, you really don't want to send everything to the exchange. They'd get very annoyed with you very quickly. But you can pretend. So as long as you've got confidence that you can stop this before it gets to the exchange within your own software, within your own control, then pick a number somewhere between 1,000 to 10,000. That's gonna be the number of times that you simulate sending an order through your system. If you're using a low latency network card such as Mellanox or Solar Flare chances are even the card will allow you to do this. This is industry practice, it understands that people want to push data onto the card but not send it. It's just warming the card. So network cards will support this, so that's great. So basically saturate your system with I'm going to send an order, I'm going to send an order, I'm going to send an order, bang, I did send an order. Keeps your instruction cache hot, yeah? Nothing's gonna get evicted if you're doing that. Probably keeps your data cache hot if you're picking the right data, and it trains your hardware branch predictor. Your branch predictor, if it's just done the same thing 10,000 times before in the last second, chances are it's gonna pick the right path this time. Hey. - [Audience Member] Do you use profile-guided optimization? - Ah yeah, good question. Do I use profile-guided optimization? Yeah, sometimes, in many different ways. Yeah, so profile-guided optimization is kind of a substitute for this, but again that's only something you can do at compile time and it's not gonna help with the hardware runtime. But indeed, yeah profile-guided optimization is great except you can overfit the model. So profile-guided optimization is effectively, you run a simulation, or maybe even run a production, where you're writing to a profiling file of what's actually happening with your system, then you recompile with that information and then the compiler can use that for branch prediction hints and reordering. So that works relatively well, but you need to make sure that your profiling that you're doing matches production. And sometimes that means that you're very fast sometimes and way off the mark other times. So it's kind of this thing that looks really great and it is some of the time. Good question. Okay, this is a Xeon E5 CPU. How many cores does this have? I claim eight. And the cores are on the left and the right. And what's in the middle? Cache. Glorious, glorious cache. L3 cache. Probably about 50 megs worth of cache. Very fast, this is what you want. How many cores get access to this cache? All of them. Well that sounds like a problem. Because I want this cache all to myself. Yeah? I turn off all but one of your cores. Now you get the cache all to yourself. It's a neat trick, it's not very respectful to any intel engineers in the room today and as a teenager I couldn't believe I'd be having eight cores and working for a company where I could turn all but one of them off. Or 22 cores and turning all of one of them off. It's a really effective way of increasing the average L3 cache per core. (audience laughs) Hey. Sorry? Have I tried using a single core CPU? Oh (laughs), interesting. So have I tried using a single core CPU or convincing someone to make one? So no I haven't, for the answer to both of your questions. We can talk about that afterwards a little bit, but I suspect that Intel are not necessarily going to be particularly interested in what I would want, because my domain is such a small part of Intel's market. That's just my gut feel. Oh incidentally, if I can go back one. Yeah if you do run multiple cores, normally you won't disable your cores, but just be careful about what other processors are running on other cores that share that cache. If they're noisy it's gonna cost you. If you've got noisy neighbors, get rid of them, put them on a completely different CPU somewhere else. It will probably help you out. Great quote. (audience laughs) Okay this is nitpicking, but this caught me out and it caught people from other trading companies out as well. Placement new can be slightly inefficient. Slightly. So if you use any of these compilers, placement new will be doing a null pointer check. And if the pointer that you pass into placement new is null it won't construct the object, it won't destruct the object, and it will return null back to the caller. Why? Because that's what regular new does. And the spec was a little bit ambiguous as to what placement new was meant to do. So most compiler implementers just did the same thing as regular new. But it's actually really inefficient. Why would C++ do an additional check for you? That's not the idea of C++, you don't pay for what you don't get. You don't pay for what you don't use. If you're going to pass a pointer into a function you should make sure that it's null or not null. Yeah, so. Marc Glisse and Jonathan Wakely got the spec updated, the standard updated, so that's now undefined behavior to pass null on to placement new. That means that the compiler can stop doing this null pointer check. Does a single null pointer check make a difference? Actually it does, yeah. So if you were right on the border of this function being inlined and not being inlined and this additional instruction could push it over the top. Also with GCC, GCC was particularly sensitive to this, and actually gave up on a whole bunch of optimizations. So yeah, just a small thing, but it actually made quite a big difference, enough that several trading companies picked up on this. If you can't use a modern version of the compiler that has the effects, there's a quick workaround that will work around it, at least in GCC. Okay another thing that caught me out is GCC five and below or below GCC 5.1, standard string it wasn't a great implementation. Had copy on right semantics and null small string optimization that wasn't there. So great, we moved to later versions of GCC, started using strings all over the place, but what do you know, it was slow. Because we use a standard distribution of Redhat and Centos and the way that GCC is packaged is it maintains ABI backwards compatibility. So it doesn't do the standard string breaking change, which means that we're still stuck with the old standard string implementation. And that was first noticed as being a bit questionable by Herb Sutter in 1999, and today in 2017 I'm still stuck with copy-on-write strings. Another small nitpick, but C++11 has this static local variable initialization guarantees that static will be initialized once and only once even if multithreaded. Even if an exception comes in and blows away that routine it'll go okay we're not initialized yet. So very useful, but of course, that doesn't come for free. So every time that you refer to that static variable inside that function there's going to be a slight cost, which is checking the guard variable to see if that has been set or not. So again it's a minor nitpick, but if you are absolutely micro optimizing code, this will have a slight impact as well. Who knows that standard function might allocate? Most people? Yep, cool. So that can be a little bit surprising, because sometimes it doesn't allocate. It depends on the implementation. You can do a small function optimization in there around about 16 bytes I think. So this is with GCC 7.2, this is a silly trivial example, but here we are, we're not actually doing anything, we're capturing 24 bytes worth of data but we're not doing anything with it. Clang will just optimize this right out, GCC doesn't. So you'll actually see that there's an allocation for this code and actually deallocation, which I didn't print on the screen. That gets a little bit annoying, because standard functions are actually very very useful. So you'll see that several people have implemented nonallocating versions. I believe there's talks on this, again I would possibly claim later on today, but maybe I'm wrong again. But yeah if you want to, this is a shameless plug by the way, but go check out SG14's inplace function, which doesn't allocate. It will allocate inplace, which means that if you declare this function on the stack, the buffer for the closure will be on the stack with compile time static asserts. It also runtime checks if need be as well. Okay, the very last point for this section, then we've only got a small section to go, is glibc can be particularly evil, if you're looking to do low latency stuff. So standard power, if you call it with one to the power of 1.4, you'll get the right answer. If you call it 1.0 to the power of 1.5, you're also gonna get the right answer, but one's gonna be way slower than the other. It's not quite 1.0 by the way, it's 1.0 with a little bit of rounding error in there. But if you look at these timings, if we go on to 1.4, 53 nanos on both an old version of glibc and a new version of glibc, great. If you do it with 1.5 the pathological case, you're in trouble. Half a millisecond to calculate one single computation of power. The reason is the function's transcendental, it'll try to be accurate fast, if not it'll try to be accurate slow. So this has caught us out, it's effectively crashed our autotraders, because if you try to calculate this a thousand times quickly, everything kind of stops. Yeah, relatively surprising, you can upgrade to glibc 2.21 which is a couple of years old now I think, that'll help you out, but most standard distros of Linux will be packaged with 2.17. So again, something to watch out for. While I'm on this topic, another thing which isn't on my slides but I should mention, is try to avoid system calls. System calls will just kill you. So you don't want to be going to the kernel, at all, you want just your C++ code to be run. You wanna get all interrupts absolutely away from the call that you're running. You want your hotpath to be running on a single call. Nothing else needs to know about that single call. Any sort of system calls, like a call to select or a call to kernal space tcp or anything that might invoke a system call whatsoever, you want to get rid of that. Sorry, just popped into my head, I thought I should say it. Not particularly funny, but very true. So the final section, only a couple of slides actually. Hey! I got a reminder. Talk, yes, that is right. (audience chuckles) That could've gone far worse. Okay, there's two ways to kind of measure things really, or two common approaches. One is to profile, see what your code is doing. The other one's to actually benchmark and time it. These two things are subtly different. So profiling is good, it can show you hey look you're spending most of your time in Malloc. Why are you even in Malloc? This makes no sense. Whereas benchmarking is actually start, stop, right this is how fast we are. If you make an improvement to your code based on profiling, that's great, but maybe your code's not faster, maybe it's slower. You know, you're just guessing. You need to measure. Once you've finished measuring, measure again, and then measure one more time, a third time. Okay so gprof, a sampling profiler. Gprof is great for checking out what your code is doing, but it's not going to work if most of the time your code is doing nothing and then for 300 or 400 nanoseconds, all of a sudden it's doing something and then it goes back to being idle. Gprof is going to tell you that you're doing absolutely nothing. It's gonna miss the actual events that you care about. Gprofs are out. Valgrind, callgrind, fantastic. Can tell you actually what your code is doing, which functions it's going into, how much memory you're using. But it's basically a virtual machine, it's simulating the CPU, it doesn't even take into account I/O characteristics, that's gone as well. It's too intrusive, it's not going to give you accurate performance results. I've been talking about microbenchmarking a lot. It's kind of okay, but don't rely on it. Because if you're sitting there spinning in an absolute tightloop, doing map lookups, that's a very different characteristic from what your code is gonna be doing in production when it's doing a one off map lookup once every eight trillion cycles. So benchmarks are kind of nice for map shootouts a little bit and things like that, but you can only take them so far. Now all of these tools are really useful, don't get me wrong I use all of them, they're great, but not necessarily for that very last step of microoptimizing your code. So we need something else. So this is what I use, and it is by far the most accurate benchmarking system I can come up with. What you do is you actually set up a production system, the server on the left here is replaying market data in real time. Across a network, my code is running on the server on the right, picking up the market data and then from time to time sending a trade or a request to trade back across the wire to where it thinks is the exchange. Of course it's just a fake exchange. What I've got in the middle is a switch, and the switch is tapped into the network and it's unobtrusively capturing every packet that's going across the wire and recording it, and also putting a timestamp into either the header or the footer of the evernet packet. So actually recording that true time that that packet went across the wire. Probably within an accuracy of five nanoseconds or something like that. Then afterwards, once you've done an hour or a days worth of simulated trading, then you can analyze the packets, look at which packets triggered an event, compare the timestamp of the corresponding packet that got sent back out to exchange, do a difference of the timestamps and that will tell you the actual true time that it's taking for your system to run. And if you see a speedup in this lab setup, you'll see a speedup in production. If you see a slowdown in the lab, you'll see a slowdown in production. Incredibly hard to set up, you have to get this perfect, all the time you'll forget ... You forgot to turn off particular interwraps or you forgot to set CPU affinity, or someone decided it might be a good idea to build on that server because they found that it was free and very few people were on it so they ran GCC on 24 cores. But this is a real pain, it's not easy to set up, but it's by far the most accurate way to do true measurement. Okay, so the brief summary is you don't need to be an expert about C++, but you do need to know it relatively well if you want fast low latency software. You have to understand your compiler. You have to understand what your compiler's trying to do and help your compiler along from time to time. Knowing the machine architecture will help. Knowing the impact of cache, the impact of other processes, it's gonna help you make create design decisions. Ultimately, aim for really really simple logic. Not just simple logic, but get rid of your logic if you can, go for static asserts, cons expressions, templates, just try to make your compilers life as easy as possible in that respect or reduce your code as much. The best thing about this is compilers optimize simple code the best by far. Simple code runs faster than complex code. It's just a fact, just as ... Sometimes you don't need perfect precision, sometimes an approximation is absolutely fine and if the approximation is fast, that's great. There's no point calculating a theoretical price to 12 significant digits if really only two digits would do. And if you need to do expensive work, don't do it on the hotpath, just do it when you think it's a quiet time. And ultimately it's about measurement. If you can't measure, you're basically just guessing. It's incredible, every time that I think this code is gonna be faster, do do do do, measure it, it's always slower. I always get it wrong, it's difficult. Okay, so that's me. This was quite lightweight, not entirely scientific, it's more just a hey this is the sort of things that we do. I hope you found that somewhat useful. Did people learn something? Little bit? So-so? Okay, cool. I'm pleased. Thank you. Cheers. (audience applauds) We've got a couple minutes for questions if anyone feels like it. Hey. - [Audience Member] Hi, can you comment in the use cases when you were only using one core on the multicore chip, do you know if Intel's Turbo Boost was helping you out at all or not? - Yeah, that's a good question. So I could spend another hour or two talking about the actual system tuning type of it. Won't go into too much detail here, but effectively you need to be very careful of Turbo Boost and different states, you just want things to be constant. You don't want things to be jumping up and down and yeah ... Cool. Hey. Oh, sorry, we'll go there. Hi. - [Audience Member] Hi, so one quick thing, the actual question is on newer architectures, you can actually force the L3 cache to be partitioned? - Yeah. - [Audience Member] Okay. You already know, but I'm talking about other people. - I'll just recap in case no one heard that, or I just cut you off. Yeah, you don't just have to turn off CPUs, you can actually lock the cache with latest versions of Intel CPUs. - [Audience Member] Right. When you describe the hash table, the layout in memory, you had an array, and you had the full hash, which you keep and then a pointer, but usually in HFT you only really care about access, not insertion. Insertion's often outside of the critical path. - Correct, yeah. - [Audience Member] Not always but often. So why not instead have ... If you have the key values in an array as well, you can just use the index of the hash itself to access that. Now you're only storing the hashes in the first lookup array so you get eight instead of four. So why not? - If I understand the question correctly, won't that put more pressure on your cache? Because you'd actually have less hash values per cache line. - [Audience Member] You'll have more, because you're not storing the pointer at all. The first layer of the lookup is just into an array with nothing but the hashes. - Oh I see what you're saying. - [Audience Member] And then if you verified that the hash is correct, you just use the index and go into the key value array. - Yeah you could do that, but that's going to also, as long as you're very confident that you're not getting too many collisions. So I think you're going to get a tradeoff. I see what you're saying, and yeah you could do that, but that's assuming a pretty damn good hash function I think. But there's always ... There's no right and wrong answer I think for hash tables there's always a bit of a sliding scale. - [Audience Member] Alright. - Thank you, cheers. Hey. - [Audience Member] So I'm interested about where you talked about using only one CPU unlocking the outlet. Others are one core. Do you mean that you're going to use one thread as well? - Yeah. - [Audience Member] And how does that work when you have multiple data sources that you need to look at and everything? Like ... - Yeah. - [Audience Member] Can you explain a bit about that? - Yeah sure, so this is getting away from the C++ side, which I purposely didn't want to talk about, being a C++ conference, but very very briefly, yeah exactly, one thread on that one core. That's it, that's the fast thread. But of course you can from time to time pull data coming in once every few hundred milliseconds, then pull the data, and statistically you should be okay. It all very much depends on how much data's coming in, if you've got a lot of data coming in, then no you'd need a second thread. But yeah, effectively ... - [Audience Member] I mean though the data that you need to respond quickly, the data from the market. So you can't actually pull them, because then you reduce your latency. - Ah yeah, true, but you could also have something ahead of you doing a little bit of a low pass filter or discarding more of the data. Yeah, you definitely have a component ahead filtering a lot of that out for you, otherwise yeah you'd be saturated. - [Audience Member] I see. - Yeah. - [Audience Member] Thank you. - Yeah, you're welcome. Hi. - [Audience Member] Just a small point. You said consider deleting large objects in another thread, one of the things I've been burned by is that we'll drain the TC Malluc size class for that and then cause global Mutex locks across all your threads when you allocate or deallocate in that size class. - Yeah, I was wondering if someone would pick up on that (laughs). True. We actually often will allocate on that separate thread as well and just pass the pointer across. Yeah, good point. I was praying no one would pick up on that. Hi. - [Audience Member] Hi, do you use things like built in expect to mark branches slightly or unlikely? - Yeah, so this will have to be the last question correct? Yeah, and I deleted those from my slides this morning because I ran out of time, as we have just done as well. I mean a built in expect, definitely for those who don't know about it, it's just a macro that's will tell the compiler hey this is the branch that I want you to take. Absolutely, you can use that, but it's not a silver bullet. Everything is about getting that hardware branch predictor trained up. In saying that, if you have a function which is called very very irregularly and you want that function to be fast when it's called. The hardware branch predictor has zero information. In that case, yeah, your likely true and likely false branch prediction hits are exactly what you need to make sure that correct path is hit. Good question, thank you. So we've gotta ... We can catch up afterwards, if maybe we have to. Times up now. We can talk afterwards. Yeah. Great, thank you. (audience applauds)
Info
Channel: CppCon
Views: 145,681
Rating: 4.9173818 out of 5
Keywords: Carl Cook, CppCon 2017, Computer Science (Field), + C (Programming Language), Bash Films, conference video recording services, conference recording services, nationwide conference recording services, conference videography services, conference video recording, conference filming services, conference services, conference recording, conference live streaming, event videographers, capture presentation slides, record presentation slides, event video recording, video services
Id: NH1Tta7purM
Channel Id: undefined
Length: 60min 7sec (3607 seconds)
Published: Sun Oct 08 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.