- [Tom] Hello everybody, I'm Tom Poole and today I'm gonna talk
to you a bit about some alternate std function implementations. Say the main motivation for this work is, or at least the main
motivation for this talk, is some work I did a
while ago which was adding a std function replacement
to the JUCE framework, which is an opensource
cross-platform effort. I'm not just saying that
to name drop the product that I'm working on. There are a couple of aspects
of JUCE that means that a std function replacement
is kind of uniquely valuable. These two things are
that JUCE has quite broad cross-platform compatibility requirements. It's also focused on audio applications. So we'll get to those two in a minute, but first it's worth just giving a quick std function recap. So in a single sentence, std
function is a type-erasing wrapper that holds
something that is callable with a specific signature
and so the things that are callable in C++
are function pointers, objects with a call operator,
and lambda functions. There's just a quick
example of each of these on the left hand side. We have function, we have
some struct which has an implication operator so we can call it and down at the bottom,
we have a lambda function. Each of these can be
assigned to a std function and that's good because
we can treat them all on an equal footing. So why is that useful? I think it's fair to say that std function really has changed the way a
lot of programs are written, not programs in their entirety
but certainly bits of them. If you're interacting with
anything that is asynchronous, before the advent of std function, the typical approach is that
you'd call an asynchronous function and you'd have to provide it with a function pointer and
maybe some void star data, to then get control back
to when you started from and the basic process
would be, you'd save your this pointer into the void star data, then when you get a call back,
because your asynchronous function has either finished or it says an error or
something, you then cast this void star pointer back to the,
in an instance of yourself, and you're back where you started from. You can monkey around with all
of your internal data again. With std function, you can simply put this in the capture list and
everything is much more contained. You don't have these
global functions littering your code base where
the only point of them is to get you back where you started from. And std function is also a
really good fit for a lot of audio processing operations
which brings us back into what I actually used it for in JUCE. So you can think of almost
all audio effects as being successive transformations
on arrays of audio samples. So each of these transformations
has the same signature. You have some input samples,
you do something with it and then your output
is the output samples. And because of all these
transformations have the same signature, swap-able
arrays of callable objects is a really convenient
structure that you can use to have a way of changing
behavior dynamically. Or another example would be, something like an oscillator class. An oscillator class
does something periodic and the user can provide a
std function to determine what the wave function would look like in that periodic bit which
is then swap-able at runtime. The std function is really
useful and people who use JUCE have been using it to get things done and so this is important
because JUCE supports a lot of systems and
as you get really old, on the desktop side, or really
weird on the embedded side, then you run into
situations where actually the standard library
lacking a std function is a reasonably common occurrence particularly in the tool
chains for embedded systems, you might lack a std library all together or it would be made by
someone who is much more concerned with constraining
it in terms of size, rather than having all the
functionality available. And also if you want to
target OS X older than 10.7, you'll find that you are
free to use a really modern compiler but you are limited
to the standard library that's available on that platform which gives us most of
C++ 11 but std function is the real stand out that
the JUCE users really wanted. And the other aspect of why std function is particularly important when
you're working with audio, so using with JUCE is
that JUCE works with audio and processing live audio
is extremely unforgiving. So the process of dealing
with audio starts out with, you do some communication
with the audio device in your system to say, okay, you tell it, I want to do something with audio and you do that by supplying
it with a call back. This call back has at
least in its arguments a pointer to some memory and the size. Once you've told the
audio device about this, at some point in the future, it's going to call that call back. And once it's done that,
you have a short window of time where you fill
that block of memory with some audio data and
then a short time after that, that memory will be
read by the sound card. The crucial thing here is that
if you don't finish writing to that memory, you will hear a glitch. This is not really like
graphics applications where you're just blasting
frames to the screen. If you drop a frame, unless
you're looking out for it, you probably won't notice but
if you drop a single sample, you will hear it and
it will sound terrible. And so why is that bad? Well if your audio
software produces glitches, then people will not buy it
and this is not particularly because musicians are fickle
or anything like that, but if you're broadcasting
live, or even doing something quasi-live like recording
a once in a lifetime take, if your audio glitches during that, the whole recording is affectively ruined. And even worse, if you are doing your bit of audio processing towards
the end of a signal chain on a really loud sound system, then you can actually cause
permanent hearing damage to anyone listening to it. So because of that, there
really is one golden rule of audio programming and
it is, if you're working on the audio thread,
you do not do anything where you don't know ahead of time exactly how long it's going to take. And if you're looking
at audio thread code, two things stand out as big red flags, the first is any locks and the second is any memory allocation. So locks are bad because,
well on this MacBook at least, the audio thread has a higher
priority than any other, so as a programmer, you
cannot create another thread that has the same priority. If you're locking, it is almost certainly because you want to
communicate with another thread and as soon as that thread is locked by a lower priority thread,
you have affectively demoted the priority of the
audio thread to the one that is sharing the lock
and as soon as you do that, the OS can just decide not
to schedule that thread for whatever reason and
suddenly you miss your deadline and you hear a glitch. Memory allocations are
bad, first and foremost, because they take quite a long time. We're generally working in a
very time constrained demand. We want everything to
happen as fast as possible. The really nasty thing about
them is that they can take an unpredictable amount of time, so every now and again,
your fetch from main memory will take much longer than it used to and this again can push
you over your deadline and you hear a glitch. Why is this a problem when you're working with a std function? Well, std function allocates sometimes. We'll talk about this a
little later in the talk but std function can avoid
allocations by using a small bit of stack space in each object, it's called small buffer optimization and the same thing happens in std string. So if the thing you are
storing in your std function is smaller than the stack
space, then the stack space would be used and you
wouldn't have allocation. If it's bigger than that stack space, then we need to fetch some memory from the heap to hold it. And the annoying thing is that,
the size of the stack space is implementation defined so
short of going and reading the code in the standard
library that you're using, there's no way to tell ahead of time what this stack size is going to be. And if you're going to
compile the same code with different compilers,
then all bets are off. We are not the only people worried about these similar concerns, so
the GameDev and low latency working group has something
called inplace function. So inplace function is a
std function alternative that is guaranteed that
it will never allocate. It essentially uses this
small buffer for everything and that's the direction that
this talk is going to go in. But also, whilst we're
kicking std function, Facebook's Folly library has a couple of other improvements. First off, it is a move-only alternative and I think this makes a lot of sense. If you think about how
you use std function, it is almost always to
define some functionality in one place and then pass
it off somewhere else. If you're just going to
define something and use it, you can use a lambda function, it would be much more efficient, and it's very rare that
you really want to copy the functionality that you've defined. And the other thing it also
addresses is the fact that it is very easy to
violate const-correctness using std function. So a std function's call
operator is marked as const, so if someone gives you a const
reference to a std function, you can invoke something. However, once you've
invoked that something, that something is free to
modify it's own internals and as soon as you do that, all guarantees about const-correctnesses
are out of the window. Your compiler will not
be able to help you. So let's dive in. So what do we need if we're going to implement a std function? At the very high level
we have a function class that is a template and
once we know what the types of the things that it takes as arguments and the type of the thing
that produces as a result, we can specialize that template. Note that here we don't
have any information about the type of the thing that this function is going to
hold, it's just the signature. So we're going to do some type-erasing. This is the approach that lib std C++ uses and it's a set of type
arrays function pointers. So on the left hand side here, you can see that we have some type array storage, just an array of characters and we have some function pointers where the types of all these function
pointers, is a sea of void, void stars, results
type and your arguments. And in tandem with this, you have a set of type-aware functions. Now these functions are type-aware because they have a template
type and this functor is going to be the type of
the thing that we are storing. And so because we know this
type, we can invoke the thing, we can copy construct this thing into a specific destination
and we can destroy it. So the constructor is where
all the action happens because this is the only
place in the entire class where we actually know
the type of the thing that we are going to wrap,
and once we know that, we produce some specializations of each of those type-aware functions. When the compiler see
this, it's going to go and stamp that out that out somewhere. If we've got a pointer
to it, we simply cast this type-aware pointer down into our type arrays-ed pointers and
then we're free to use them, so in the constructor,
we give it an address of the storage we're using,
same with the destructor, and same with the invoke. These are all just operating
on type-aware memory, that these type arrays
pointers are actually pointing to type-aware functions. So that's the nuts and bolts of how this thing actually operates but to make something useful, we want to put these things in containers. There's no point making
a type arrays thing if we can't treat all of
the things that we contained on the same footing. So for that, there's not
too much interesting here. We just note that we now
need to know the storage size as well because if we're assigning from a different function,
we need to know the size of the thing that it
contains so we can make our internal storage
large enough to take it. So we've done all that. We want to know how this
thing is going to perform so we've got a very simple benchmark here templated on this function type. This function type is gonna
first be a std function and then it's going to be our replacement. We have an array of these
things in a default state. Each one of them, we're
going to assign it to be a function pointer and then
we're just going to loop over them all again and produce a sum. And so, we see how we
did, pretty terribly. So I had this guy in the
greeny-blue is std function and the other side is our alternate. This shouldn't be too much of a surprise. If we jump back a couple of slides, then we can see every time
we're calling a constructor or every time we're doing an assignment, we're having to reallocate
our internal storage and calling out to main
memory is very slow so the fact that std function
is beating us comprehensively isn't much of a surprise. So how's it doing this was,
as I previously mentioned, if they thing that you
are storing is small and in this case we're just
storing a function pointer, then std function is free
to avoid an allocation by storing this directly in
some stack space in itself. It's annoying because this
is, as a non-configurable implementation defined size
but if we're going to be building a replacement for std function, we want something that operates similarly so our users are not going to be surprised and it's pretty easy to add in. Don't worry about reading
most of that, it's not particularly interesting, the
interesting bits are down here where we have some stack space. If you haven't seen std
aligned storage before, it's essentially a very thin wrapper around an array of characters. It's nicer than an array of characters because it has some same
defaults for alignment on the platform that your running on. That makes more semantic sense. You don't need an array of characters, you're not gonna print it,
you're not gonna read it. This gives you more a generic storage type which just more clearly indicates, this is just an area of
stack that we want to use. I have a extra storage pointer here and this storage pointer
is either going to be null for the default state,
it's going to point to the heap if we're exceeding our stack
size or it's going to point into ourselves to the stack. And so you see in the constructor, if the size of our
functor is small enough, we use our internal storage
and if it's too big, we use our external storage. And you can see on the
right hand side that this, there's nothing difficult going on here but you can see the increase
in complexity of this function when we have to decide where we're going to store our internals. This could be dried,
so this is essentially the destructor and you
could write this in terms of a swap function but I
just wanted to really list all the operations that you would need to do to make this work. So if we benchmark this
against std function, then we're doing more or less alright. We're a little bit slower
but we haven't put any effort into optimizing this whatsoever, so again that is not surprising at all. So if we look at the size of
our alternative implementation, we see we're pretty chunky. We have three function pointers,
we've got some stack space, we've got an int for the heap size and we have an extra pointer as well. Now if you look at the lib
std C++ implementation, it deals with this in
a much more clever way. So the invoke pointer is
in a union with the stack, the create pointer, destroy pointer, and some other functionality
are all bundled into a type arrays management function, which does all of those operations. It just takes an extra argument which is effectively an op code telling it what it needs to do and the heap size and the storage pointer
will abstract it away. But we're not gonna do that. We're just going to take a step back. If we look at what this code is doing, we have effectively
written our own V table. A much nicer way to deal with
this is just to use a V table. So this thing, this is how most
people are more comfortable with doing type-eraser,
so on the left hand side, we have a type arrays base class which is doing a similar kind of thing. So we have an invocation operator and we have a couple of methods
for copying memory around, and we have a derived class
that actually holds a concrete implementation of the thing
that we are trying to wrap, this functor here. And we have one of that. Dealing it with it is very easy. We can just call it so
we'll copy construct it into certain places. And actually the internals of
our overall function wrapper are then much more simple. We just have some stack space
and we have this function pointer, this base class function pointer which again is he's going to point onto the heap or into ourselves. And if we do this, then
we effectively recover, with no extra effort, we've ended up being exactly the same performance
as the std function. This is the std function from lib Cpp if anyone's interested. So now we have a std function replacement that is working at the same speed. What are we gonna do with it? So this is the point where
someone may throw a spear out of the audience and I have to dodge it because adding declarations
to the std namespace is undefined behavior
but we've done it anyway. I know, I can hear that. (laughs) So this is dangerous because
there's an unfortunate coupling between the std
namespace and the compilers. The compilers are free
to look inside there and if it sees something
called std function, it at least reserves the
right to make assumptions about the memory layout or how
you operate with this thing. So we have taken some precautions, so we have a whole sea
of pre-processor checks to make sure that we are
running on a platform that doesn't already have a std function. Some other things that
work in out favor is that, JUCE is distributed as source code so whilst it's perfectly
possible to compile JUCE as something like a D-L-L
and try and pass one of these things into it or out of it, if you're on a system
which doesn't otherwise know about std function,
it's very difficult to know what you're going to do
with it on the other side. And because of this,
it's been out in the wild for a couple of years and
we've had no ABI issues or any other that we know about. So that's all well and good
but it hasn't really solved the problem of this allocation
and being safe to use on an audio thread and
this is the simplest way of solving that problem. All we do is when we
construct one of these things, if the size of the thing
that we want to wrap, is greater than the size of the stack, we would just fail to compile. This gives us a real cast iron guarantee that when you use one of these things, it's not going to allocate. And another benefit of doing this, is that the internals of all our things, like copy constructor or assignments, become much more simple
because we don't need to worry about where our internal
functor is stored, like it could be on the heap
or it could be on the stack, now it is always on the stack,
there's much less branching. What this translates to is that actually we can get a pretty significant
speed up over std function, so again, this is a
pretty targeted benchmark so you are unlikely to
see a 30% speed increase just from switching to this
but nevertheless, it's notable. So we can do that and we can get a bit of extra runtime performance but in terms of our memory management, we are now doing terribly
so there's a pretty good argument to making your
std function replacements have a total size of either 32 or 64 bits so it lines up nicely with the cache lines on typical systems that JUCE users use. But if you do that, particularly
if you make it 64 bites and you only store
function pointers into it, most of that stack space
is going to be unused so, the typical way of
getting out of this jam, is something that std vector use, is just you add the ability to plug an allocator into things. And when you do that,
you affectively delegate your allocation behavior to the allocator. So pre C++ 11 allocators
were extremely awkward. They got a little bit better in C++ 11 and even better ones will
be available in C++ 17 but we are firmly anchored to C++ 11 and even though they have improved, using them is really rather cumbersome and actually if you look
around at a lot of audio codes, it's very rare that you
see anyone doing anything with custom allocators but if you were, the general strategy would
be that your allocator would allocate a big
chunk of memory up front in a non-CRISP performance
critical bit of your application, and then rather than
dialing out to main memory when you need a bit of chunk, you just dip into this pre allocated pool. There's a slight performance overhead but it is nowhere near the cost of dialing out to main memory. A much better approach was
presented by Louis Dionne at last years CppCon talk,
where he was advocating really the ability to use things like the small buffer optimization in almost all C++ objects. It's a much more comprehensive approach, it's super powerful and if
you've not watch the talk, you definitely should because
it is really inspiring. However, you can get most of the benefit of using such an approach
with much less effort and you can do that by just
having more indirection. So now this function class here, it's a bit of a misnomer
so you would not be able to use this function in the same ways you would a std function. So what happens here is
this is a virtual base class and all the action, everything
that I've showed you, has just been shifted
into derived classes, but the crucial thing here
is that these derived classes are free to change the signature, so you can have a templated stack size. Now if you know the
sizes of all the things that you are going to put in
your std function wrappers at the time that you construct them, then you actually have optimal
memory management this way. And you can still put all
of these things in a list or in a container because
you can use pointers or references to the base class. Benchmarking this against
std function is more tricky though because you can use
them in different ways. If you had a container of std functions, you would store the std functions in that container but with this approach, the lifetimes are slightly different. You'd store the base class
pointers in a container but the actual objects, these things, need to live elsewhere. Nevertheless, if you do shoehorn them into the same benchmark, you will see that the results are a
little ambiguous anyway. Changing the C++ standard,
changing the compiler, changing the standard
library you're comparing to will all produce different results, and the upshot is sometimes
std function is slightly faster in this approach, sometimes this approach is slightly faster. It probably evens out about the same. But this approach, you are
guaranteed to not allocate. There is however, a fairly
large elephant in the room and that is, what if you just
don't do any type-eraser? So this is the world's least
useful function wrapper in that it can only take function pointers and the only reason
that we've created this is to put it into exactly
the same benchmark as everything else went through. And if we do that, it's pretty conclusive. Because the compiler
can see into the type, it can work out that this entire function always returns exactly the same number and we've squashed down
to a single instruction, just return this single value. So there is a pretty massive caveat. If you are doing type-eraser,
then you are telling the compiler that the thing
behind this type-eraser, you can't know what that is. This going to be a runtime property and that can be a pretty
significant performance killer, so the caveat is that you
really should only use a type-erased callable wrapper
if you really don't know the memory layabout of the
callables that you need to store or you really need to
store different callables in the same container. There are certain user cases
where you need to do that but if you can not do it,
then you will probably have a faster program. And a related note is that
if you see a std function in a function signature,
then typically if you instead use a templated type and
then just call that type, your code will run faster. This is because again, the
compiler can see the actual type of the thing that you are using and so you can get things like inlining. Of course, this does mean that
the contract of this function has been blown wide open
because there's no constraint on arguments or return
types or anything like that, but often the overhead
of having really horrible error messages when you
try to compile is worse or at least that's more tolerable than having a slower program. So to conclude, rolling your
own version of std function is not particularly difficult. We've talked about all
the real pain points. My code is available and
actually if you take a hatchet to your favorite standard library, working out what they do
isn't too difficult either. And once you've done that,
you can replace the internal memory management of std
function with your own scheme and this lets you do things
like enforce realtime safety at compile time by guaranteeing that you're not going to allocate. You can get a modest increase
in runtime performance or you could do something
much more adventurous, but the really important caveat is that, if you're able to restrict
your callable object to those with a certain
memory layout or type, then do not use a type-erasing wrapper. So that is it, there's a
load of information there and I'll take any question. (audience applause) - [Man] Do you know the S-G
14 inplace function work? Is it the same, similar approach to your-- - I'm really struggling to hear you. - [Man] What? - I'm struggling to hear you. Can you get a bit closer to the mic? - [Man] Sorry. - (laughs) Sorry, I'm really
struggling to hear you. - [Man] Okay, so do you
know how the S-G 14 inplace function work? - I think it works in very
much a similar fashion to what we've displayed there. (man speaking without a mic) - Okay, so it's effectively this thing but by itself, without a base class. Oh sorry, and the session
is over, I overran a bit. (audience applause)