A Low-Latency Library in FPGA Hardware for High-Frequency Trading

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so the make speaker doesn't need a lot of introduction to this audience John Lockwood he's been associated with this conference for a long time he has been a faculty member at Washington University where he led reprogrammable network architecture group I'm not sure that was the act earlier and he's led the net FPGA team at Stanford and he's now with alga logic he will talk about network acceleration as well but in the context of a specific domain of high-speed trading Thanks okay so thank you so for the talk I want to give an introduction to high frequency trading don't give a survey of some of the software and hardware in hybrid approaches that's being used on Wall Street today talk about our approach which uses field programmable gate arrays to do hft and some of the advantages in the disadvantages of using FPGA s and then talk about the low latency library that we've implemented it have running and deployed with the net FPGA 10g platform so that was our Stanford open platform that we built as a second version of it and we'll show how that was used to do some exposure and position tracking applications talk about what protocols are supported and give results from the deployment so high frequency trading is trading of equities options futures at high speeds and large volumes the idea is to earn money by exploiting these fleeting variations in stock prices is that it only takes a small amount at a high volume in order to make money so hft actually accounts for over 70% of all the trades in the US markets in 2010 and it's more now and so it's a large part of what happens in the market it involves using computers to place orders based on predefined algorithms and so hft is based on using computers and taking a human out of the trading process or human guiding the high-level operation but then using machines to execute so some of the main challenges for the financial markets is latency being able to execute orders faster than other investors is what makes the ability to earn money and the other aspect is jitter is that you need to provide a consistent and fair execution time for the trades secondary considerations our throughput is you have to hit a large volume of orders and then flexibility which is you have to be able to adapt to handle different strategies and different risk patterns that can come up so some of the recent problems that happened with high-frequency trading just recently was what we call the nightmare night capital and so what happened within the nightmare was that night capital had some test scripts that they're running on they're in the development lab which was the test that there were market maker and in order to test their market making is that they had these test scripts that would buy and sell against their other software which was doing the market making and they did a software update one morning a few weeks ago and they went over to the deployed network and the script was still running that somebody forgot that to go into the cron job or turn off the script that was running they lost one hundred fifty million dollars in 45 minutes I thought the company merely out of business through scavenging money from some of their customers and they became investors basically they transfer the entire ownership of the company over over the weekend so that they could open up on Monday morning otherwise it would have been a nightmare for the SEC to deal with and clean up and so so that that was one problem Nasdaq the Facebook IPO another disaster is that in this case there was 62 million dollars in direct damages on the day that Facebook launched its IPO because they couldn't price the IPO that morning they couldn't compute what price it should start trading at and they didn't give the order execution reports back to the people that had placed the orders and so as a result no one knew what they owned when Facebook went public and that was a big problem because Facebook stocks start off at what $42 on on the first day of the IPO and it's fallen to less than $19 today and so how quickly you can get rid of that stock which what everyone was trying to do was what made of money and so so now is that Casa is likely and they're in final negotiations with the SEC over this but is likely to to pay out sixty-two million dollars to make up for their software glitch that they had as a result of us another problem was with the bats failure is a software bug in the order Auctions force bats was doing their IPO that day and that on the day of launching their IPO they screwed up training not only their own stock but all the A's and B's they also screwed up Apple and so as you can imagine for Apple shareholders were more than a little bit upset when they started having wide price fluctuations in it as a result they were forced to cancel their IPO in that same day and refund all the money the next day so that basically put that company also lost out on a lot I think the biggest problem with all this is that this hft has not only hurt the bank's institutions I use the direct loss you know half a billion dollars just from these alone but it also really hurts the credibility of the market when people lose trust in the market they stop using it and so a lot of people don't like hft because they feel that it's it's not trustworthy that there's glitches that a software problem can take down the whole market and I think that's that's the biggest problem it has to be solved and so in terms of solving problems is that we can look at hardware solutions and software solutions and let's start with a software solution so first of all most of the trading happens with NIC cards and so with Linux if you have taken an optimized NIC card you can get kind of typical round-trip times of 15 to 20 microseconds or half round-trip times through an unauthorised kernel so if you don't do anything and just put a NIC and your host is that you can get those kind of round-trip times if you use an optimized TCP offload you can get numbers down to like 2.9 microseconds transmit 6 microseconds receive now you can do better with software you can use techniques like Datagram bypass layer and get down to say three and a half microseconds for UDP four microseconds for TCP and then of course there's also the InfiniBand group of the MPI that can get down to you know micro second level times although that excludes the application layers that you have to add some time for the processing that it takes to actually perform the operation that you need to do on the hardware approach is that there's graphic processors GPUs but GPUs are really optimized for throughput they're doing the big long deep vector machines that are optimized to push a lot of numbers through so that they're not well optimized for latency in plus you have to go across the PCI Express bus in order to talk to them and so the combination of those two GPUs are used on Wall Street but they're not used for for trading there's also Asics and of course Asics achieve sub microsecond latency but the problem is they don't have the flexibility to handle different protocols and features and so how Wall Street trades in what trades work changes on a daily or weekly basis or daily basis and so it would be difficult to build a business case around building Asics for Wall Street because you would obsolete your chip before it got fabricated and so that that doesn't work well FPGAs have the benefit much like Asics where you can get the latency down you know the point zero point two microseconds with TCP for example of processing TCP packets and you also have the flexibility to change and support different protocols and features and so FPGA seem to be a pretty promising technique for doing trading on Wall Street it's a technique that almost all the firms now are using and so in the last two or three years every company on Wall Street has started using FPGAs and we've been involved with them through that process so let's look at on the FPGA side there's a few ways you can do FPGA computing you can use what you might call the hybrid computing model and so the hybrid computing model is a combination of having the CPU plus a NIC plus the FPGA offload engine and so for here the packets come in they can go through the FPGA and they're passed up through a neck to go through the driver they go through the OS or sometimes by pass through the OS because you might have an OS bypass and then make the way up to the application running for the financial application so the issues with the hybrid computing is that sometimes you get unpredictable PCI Express bus transfer times so if your PCIe card you may have other transactions that are contending for that you may have a memory copy you may have a cache miss and then you also have AM bells law which means that although the FPGA may give you some parallelism you may also have sequential execution which dominates the total execution time so another model is the pure FPGA computing model which is what if you could do everything on the FPGA and so that way you could avoid the bus copy you could avoid memory copy you would have no cache misses it would be all parallel execution and so you know just as a graph of this of thinking if you look at a chart of what the processing time is verse some the number of messages so this is basically a distribution plot that shows two things it's showing latency and showing jitter and so on the latency side the latency side is how much time it takes to process a message and so with software for example you might have an average say a 5 micro second delay for processing a packet but you may also have some jitter which means sometimes is a little faster like 2 microseconds some time it's a little slower like 7 microseconds and so the spread the width of this curve is the is the jitter and so whereas with hardware you can get that latency down to about 2/10 of microsecond so 200 nanoseconds by going through a few clock cycles of pipeline stages and you effectively have no jitter the only jitter you have is on which phase of the clock did you arrive and so if you're running at say hundred fifty six megahertz you might have a six nanosecond jitter based on if you showed if the packet came across the asynchronous link just before after the clock edge otherwise it's a deterministic edge and so you get a very tight bound on the latency and you also start off with a small number big problem though with FPGAs is another way to look at what happens with building FPGAs is the latency that you can achieve first how long did it take you to get your product to market so this is a time development time and market this is measured in hours days weeks months years so this is the logarithmic plot on this scale latency starts off this way and so a benefit of software solutions clearly is that it takes less time to get started is that that within a matter of hours you can be up and GCC make compile run you can be running in applications and software and so that's a benefit that it's hard to get the latency down and so you can go back and optimize your code you know GCC - oh five that's one step you can use a better software algorithm to tune it and tweak it but you kind of hit a diminishing returns point after a while which means that no matter how much you keep banging on your code it's not going to get any faster and so even if you spend a year or two on it you're still only slightly faster than you were before and so with that PGA is the problem they have is this this generally takes a long time to get your first FPGA circuit working and so that it may be weeks or months or years before you have your app running on the FPGA but once you get there you start off with a lot less latency and so that you've got this inherited Vantage of being faster it's just that it took you longer to get to market now you may be out of the market by the time you get this device traditionally a lot of companies that have tried doing this if they didn't get to market in time they lost out because someone else was eating their lunch by trading with software before them so we put together a library of trying to get to market faster we call that the low latency library so it took a couple pieces of some infrastructure some protocol parsers in and also keeping market data and local memory and so a block diagram the whole system is here and we'll go through some of the pieces so starting off with the infrastructure so the infrastructure that we have is includes some of the 10 Gigabit Ethernet Macs and so these are Sai links components but then our part is all of the IP packet processing the session and Datagram processing plus an interface so that you can control and configure the FPGA from software and so we still have the user at software setting up the connections and configuring how they want to have the Opera which flows and sessions belong to which traders so we have a C++ API that you can write to and it sends messages down through a standard and they get read into the register interface so we can set up and control and configure the device in a pretty standard way there's also protocol parsers so in order to be on the stock exchange depends of which Stock Exchange that you're on so the the most common exchange language is what's called fix the financial information exchange that's actually a text based protocol where you lay out with different text fields field equals value tuple or key value pairs what your trade is but then there's number of binary protocols and so on Nasdaq they use what's called outch on with XP rs is used for direct edge and Chicago it's BATS there's our co direct for NY IC and then in London it's the London Stock Exchange and so we've set up really I'd say our biggest contribution is that we've set up protocol parsers that handle all of the major exchanges so that we can parse out and extract and make decisions on stock orders over a number of different protocol formats they come in on this is all being done in the FPGA hardware so just as a kind of a graphic view of what this looks like is that if you look inside the FPGA like with a model sim waveform or a nice in waveform you would see an ell packet so this is a nasdaq order coming in that has a 14 buy order token a buy sell the number of shares the stock name and a price and so but what we see on the FPGA is these are clock cycles on a 64-bit bus and so that's one two three clock cycles so as the packet streaming through we're just picking off the fields that make up the order and then making a decision on what to do based on that okay the last thing is for stock market price and tables is that sometimes the decision you have to make is based on what price the stocks trading at and so that we maintain these local memory on the FPGA so this is the block grim míriam that gets used to store stock price tables in a compact cash format and so we have market data updates that come in and so market data also gets sent over the internet then can be populated into this table so what we did with this the reason for doing this was by building this a pre-built library that has all the parsing in the infrastructure and the market data has components the idea was to get we moved this line to the left and that's really all we did is that we took this pre-built library and instead of being weeks or months or years to get a product market we moved it to left by having a library that got us there and the new customer can get there in days or weeks and so but it still has the benefit of the FPGA solution meaning that it's got the low latency data path and so from the starting point then we we start off at this green point here which would be at still a slightly longer time to market you could get there faster and software but a lower latency in general but in a more consistent latency so we've been working with different areas of applications and so kind of the applicant people that use this our traders brokers market makers and exchanges and so some of the applications of having this map into FPGA hardware's that we can do things like trading strategy and so that would be the algorithmic trading itself feed processing risk management and so for risk management making sure that orders that go into the stock market are same meaning that those orders are within the ranges that they should be that a customer is not overextending that software some errant software process running on a machine isn't injecting bogus orders you can stop that just by having the extra 200 nanoseconds of wait and say you can verify that you never had that nightmare happened because you could have been checking to make sure that you weren't putting silly orders in I mean of course you could have not run the software script in the first place but if you had this extra risk check process there you could have made sure that that software process didn't bring down the company and so we think that for people that are doing compliance or trying to mitigate the threat that your IT person brings your company down on a Monday morning is that you could have an extra system an extra FPGA a card a little net FPGA 10g card just checking to make sure that crazy things aren't happening other things smart order routing could you match in internalization number of different things that might be useful so the operations again that we do in hardware we put together a demo and as a way to show what what's going on is we can wrap our scene fix execution reports and maintaining on a per security basis stand on a per flow base as a session basis looking at with the long exposure the short exposure value for security across all sessions and position per security across all sessions by having all these trades pass through the net FPGA 10g card and then on a periodic basis showing on a dashboard to the end customer what's going on and so for example this is the debt this is the dashboard that runs on the GUI is this is showing for example you've got a group of algorithmic traders at a firm that are trading and all their orders are the fix orders were going through to be executed on the exchange but passing through this card and as they pass through making sure looking at what exposure that they were putting the company at risk at and so this dashboard would have had night capital had a dashboard like this they would have seen within five seconds or two seconds for one second it takes a little bit longer for the GUI to update but they would have seen that their their exposure positions would have been highly out of whack and that so that's in this set up is that it would be beneficial for we think for a new trader or thin firm to be monitoring the risk instead of letting it go for 45 minutes unabated when 45 minutes later they realized they lost 50 million dollars so some of the specifications of performance numbers again based this implementation was on the net FPGA 10g a mundane card by today's performance requirements is that this was something we did a couple years ago back at Stanford the application was this position and exposure calculation protocol fixed for point to a small number of sessions ten sessions just this was just doing hundred securities that was the limit of what we could show on the on the GUI graph internally we have also done implementations that have tables that contain every traded symbol on the market for equities clock frequency so this is line rate ten gigabits per second are processing latency 200 and seconds that's our logic extracting the messages there is a 10 gig thigh delay 70 FPGAs the 30s are not as well optimized as they could be and so looking at extracting from pen to pen which is often a common metric is that about a microsecond append append latency so the results are that we implemented this FPGA gate where to do border flow processing it parses all the protocols for major exchanges it maintains the market data and local memory demonstrated this with ten sessions of fix 4.2 on the net FPGA 10g and achieved the effectively jitter free performance six nanoseconds within it with a twittered nanosecond latency of processing goya so for more details there's a lot more details on the website that can go in and talk about which protocols are supported and what it does that's on the website and it also we're here to talk with people and our office actually is located just three blocks away we relocated from Palo Alto a year ago and so we're just on the other side of Nvidia and so in our lab here we have set up for example or this is our ten gigabit stock trading setup and so from here we can simulate in test market conditions of placing orders and do verification scripts so we take machines and before they get deployed into Wall Street we do full system testing and then just drop ship the boxes off to UPS the data centers in New Jersey so that's it thanks you
Info
Channel: InsideHPC Report
Views: 21,830
Rating: 4.9676113 out of 5
Keywords: hpc, supercomputing, low latency, high frequency trading, FPGAs, accelerators, Hot Interconnects, insideHPC
Id: nXFcM1pGOIE
Channel Id: undefined
Length: 22min 31sec (1351 seconds)
Published: Wed Aug 22 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.