THOMAS NATTESTAD: Hi, everyone. My name is Thomas and together
with my colleague Ingvar here, we're going to show you
how using WebAssembly can speed up your
computationally intensive workloads by more than 10x. And how using modern
WebAssembly tooling can let you take advantage
of WebAssembly more easily. We'll start by reminding
everyone what WebAssembly is and showing some
of the improvements we've been making to
Chrome's implementation. Then we're going to get into
some of the different language features that are starting to
ship as part of WebAssembly. And then finally,
we'll close out by covering some of the
new tooling updates that have been coming as well. So let's start by
reminding everyone what WebAssembly actually is. WebAssembly is a new
language for the web that is designed as a
compilation target to offer maximized and
reliable performance. It's important to
remember, though, that WebAssembly is in no way
meant to replace JavaScript. Rather, it's meant to augment
the things that JavaScript was never designed to do. So let's look at some of
the different advantages of WebAssembly and why
you might want to use it. First, because WebAssembly
offers strong type guarantees, it gives you more consistent
and reliable performance than JavaScript. Then, with additional features
like threads and Cindy, which will get more
into later, you can also achieve speeds that
are truly higher than what you can with JavaScript. When thinking about comparing
baseline performance WebAssembly to JavaScript,
I find this metaphor which my colleague [? Sama ?]
came up with really useful. JavaScript is like
running along a tightrope. It's possible to go fast, but
it requires a lot of skill and it's possible to
fall off the fast path. Whereas baseline WebAssembly is
more like running along a train track. You don't have to be as
careful in order to go fast. Another advantage of WebAssembly
is its amazing portability. Because you can compile
from other languages, you can bring not only your
own code bases and libraries to the web, but also the
incredible wealth of open source libraries built
in languages like C++. Lastly, and potentially
most exciting to many of you out there, is the possibility
of more flexibility when writing for the web. Specifically, the ability
to write in other languages. Since the web's
inception, JavaScript has been the only
fully supported option. And now through WebAssembly,
you get more choice. Most exciting
though, is the fact that WebAssembly is now
shipping in all major browsers, making it the first new language
to ship in every major browser since JavaScript was created
more than 20 years ago. So now that we all are reminded
of what WebAssembly actually is, I want to cover
some of the improvements that we've been making
directly in Chrome. One of the biggest requests that
we've heard from our developers is the desire for
faster startup time. To improve startup time
for WebAssembly modules, we're starting to
roll out something we're calling implicit caching. To recap, when a site
loads a WebAssembly module, it first goes into
the lift off compiler to start executing immediately. It then is further optimized
off the main thread through the turbo fan
optimizing compiler, and then the result is
hot swapped in when ready. Now, with implicit
caching, we also cache that optimized
WebAssembly module directly in the HTTP cache. Then, after the user leaves
the page and comes back, we load that optimized
module directly from the cache, resulting in
immediate top tier performance. As the name suggests, implicit
caching happens automatically. But there are two tips worth
knowing and keeping in mind. The first, is that code
caching in WebAssembly works off of the streaming APIs. So make sure it's always
used compile streaming or instantiate streaming. The second thing is
just to make sure that you're being
cache friendly. WebAssembly keeps the
cache based on the URL of the WebAssembly module. So if this changes
on each load, you won't see any of the benefits. In addition to new features
like implicit caching, we're also always
making improvements to our WebAssembly engine. Here you can see how
commit by commit, we've cut startup
time by almost half since just the start
of this last year. OK, so now that we've covered
some of the improvements that have been made in
Chrome, I want to get into some of the
actual new language features of WebAssembly. The first feature that
I want to talk about is WebAssembly threads. Threads are a key part
of practically all CPUs, and utilizing them
fully and effectively has been one of the great
challenges for the web until now. WebAssembly threads
work by relying on three specific things-- Web Workers, SharedArrayBuffer,
and atomic operations. Web Workers allows WebAssembly
to run on different CPU cores. Then SharedArrayBuffer
allows WebAssembly to operate on the same piece of memory. Lastly, atomic
operations, specifically atomic.wake and
atomic.notify, let you synchronize your
WebAssembly so that things happen in the right order. Google Earth adapted WebAssembly
threads with great success. They saw their frame
rate almost double and their number of dropped
frames cut by more than half. Soundation, a music
editing studio, similarly adopted
threads to enable highly efficient paralization. As they increased their
number of threads, they saw their performance
more than triple. One application that I'm
particularly excited to share is coming to the web through
WebAssembly threads, is VLC. They were able to originally
compile their code base to baseline WebAssembly. But without threads,
they weren't able to achieve anything
close to the performance that they needed. Now thanks to threads, they
have a working prototype working directly in Chrome. So going back to our
analogy from earlier, if baseline WebAssembly it's
like running along a train track, WebAssembly with threads
is like an actual train. You're achieving speeds that
were previously impossible. Threads have been
available in Chrome desktop since version 74. In Android, Chrome, and Firefox,
threads are implemented, but not enabled
by default. We're actively working with
other browser vendors and the WebAssembly
community to make threads available in more places. [? Send ?] threads are
not supported everywhere. It's critical to use
feature detection before relying on their
presence, which Ingvar will now show you how to do. INGVAR STEPANYAN:
Thank you, Thomas. Unfortunately,
WebAssembly does not have a built-in feature
detection yet, although it's being actively worked on. For now, we created a
JavaScript library instead that you can use to detect
WebAssembly features supported by your browser. This allows you to build several
versions of your WebAssembly module, for different
feature sets, just like you would
for modern JavaScript bundles and dynamically choose
the ones that your browser can handle. For example, you can use threads
function in order to detect [INAUDIBLE] [? browse ?]
[? simple ?] threads in WebAssembly. Then you can use
dynamic input to load either version of your
WebAssembly module and the JavaScript
binded set makes user threads for optimizations,
or regular one for the older browsers. How do you build a version for
threads, in the first place? If you're using a script and
you need to pass an argument -pthread during
compilation, like you would to regular, native C compilers. And it will automatically
generate the WebAssembly module and the JavaScript necessary
for creating, managing, and communicating with the
Web Workers under the hood. If you aren't in C
[INAUDIBLE] allows you to use common
POSIX thread APIs, just like those available
on native Unix platforms. For example, you can
use pthread_create with the handler
function and arguments, in order to start a new thread
and [? writing ?] the code pthread_join in order to wait
for it to finish and read the results back. If you write in C++,
good news has it, Emscripten [? implemented ?]
an implementation of standard thread APIs, just like in Unix
makes use of POSIX threads under the hood. And other high level
APIs, such as std::async, makes use of std::thread
at the C++ standard level. So they all just work. This means that, for example,
you can use std::thread with closures in the C++ code. And it will [? lower ?]
to the same pthread goals and handled by Emscripten. Similarly, you can use
std::async APIs to spawn futures, which are quite
similar to JavaScript promises, but allow you to spawn
tasks on your threads. And the [INAUDIBLE]
this stories, not just [? been ?]
[? fleshed ?] out, as you need to maybe
create Web Workers, send them to WebAssembly module
and [? memories ?] that you want to share, as well as
rebuild the standard library with thread support. However, after jumping
through a few hoops, you are able to even use popular
multi-threading libraries, like [? Ryan, ?] like in this
demo by Rust WebAssembly team. Here, they [? brought ?]
[? in ?] a ray tracer to split and read into several threads
and compiled it to WebAssembly. You can see how,
with a single thread, it takes 1.7 seconds to
render the entire image. But if you split working,
to say, four threads, it takes only 0.8
seconds, making it more than two times faster. Another performance feature
that is making its way into WebAssembly is SIMD. And I'd like to invite Thomas
back, to tell us what it is and how it can help us. THOMAS NATTESTAD: Thank
you, so much, Ingvar So, SIMD stands for Single
Instruction Multiple Data. And while this may not be a term
that most web developers are familiar with, it's
an absolutely key part of modern CPU architectures. So to explain SIMD, let's
take this simple example of adding two arrays
together into a third array, using a simple for loop. Without SIMD, the CPU
goes through this loop and adds the different
elements together, one by one, taking four full steps. Now, with SIMD, the CPU is able
to vectorize these elements and then take just a single CPU
operation to add them together. This may seem simple, but
it can have dramatic impacts on performance. To show the power
that SIMD can deliver, I want to show off some of the
work done by our colleagues at Google Research. They've developed several
real-time ML models that can do everything from
letting you try on fake glasses or puppet masks, doing dynamic
background removal, and much more. One of the coolest demos is
this hand tracking system. And here, you can really see
the difference that SIMD makes. Without SIMD, you're only
getting about three frames per second, while
with SIMD, you've got a much smoother
15 frames per second, which makes all the difference. You can visit this link
to check these out for yourself or come by the
sandbox to play with them. The Google research team looked
at a bunch of their models and found that, in general,
SIMD offered a 3x improvement on overall speed. The next example that
I want to show off is OpenCV and some of the
work done by our friends at Intel and UC Irvine. OpenCV is an extremely
popular image analysis library that has tons of performance
dependent functionality. OpenCV can be compiled
to WebAssembly and run directly in the browser. It can be used for doing
things, like card reading, replacing real
emotions with emojis, and for all the Harry
Potter fans out there, you can now have your very own
web-powered invisibility cloak. You can visit this
link to try them out. Or again, come by the sandbox
to check and see them there. This work has actually been
fully upstreamed into OpenCV. And they even have
a tutorial on how to setup OpenCV
with the Emscripten, so that you can all play
with this yourself, at home. And all of this functionality
can take advantage of threads and SIMD to dramatically
improve performance. Here we can see the
visual difference of first adding SIMD and
then SIMD plus threads. And our benchmarking backs
up this visually noticeable difference. When using both threads
and SIMD together, common tasks in OpenCV can
be improved by around 15x. And some of the benchmarks show
even more dramatic improvements from threads and SIMD. For the OpenCV kernel
performance test, using threads gives
you a 3.5x improvement. And using SIMD gives you an even
more impressive 9x improvement, just by itself. And then when you
take these together, it results in an
overall 30x improvement to this performance test,
which is truly staggering. So coming back to
our train analogy, because who doesn't love trains,
if WebAssembly threads is like an old-style train, using
threads and SIMD together is like a modern bullet train. So to show you how to actually
take advantage of this in code, I'd like to hand
it back to Ingvar. INGVAR STEPANYAN:
Thanks, Thomas. To build code with
SIMD and Emscripten, you need to pass a special
parameter -m, which tells Dandelion's
[? sealant ?] compiler to enable a specific
feature, followed by simd128, which is the feature name
for the currently supported 128-bit SIMD operations
in WebAssembly. In Rust, you need to pass the
same feature name, by a -C target-feature compiler flag. The easiest way to do
this on a real project, using cargo wasm-pac
is currently [? serene ?] environment
variable RUSTFLAGS, passed during compilation. Now that we've covered
how to compile our code, let's see what it takes to
actually use SIMD in our code. The good news has it,
in the simplest case, the answer is nothing. That is, unlike with threads,
SIMD [INAUDIBLE] compiler can often make advantage
of, and take care of, without you having to
modify any code at all. This compiler feature is
called auto-vectorization. And it detects
loops that perform [? same ?] mathematical
operations on array items, independently. For example, let's take a
look at this simple code in C. On [INAUDIBLE] one in C++
All the same one, in Rust. Such a loop operates
on an array of numbers. Check. It performs
arithmetic operations. Also, check. And it clearly operates as
an independent [INAUDIBLE] Also, check. So the compiler should be
able to make use of SIMD to process several elements at
once-- [? Ryzen ?] handles them by one-to-one--
and make it faster. Let's see if it does. First, let's compile this code,
in any of the source languages, without SIMD enabled
and take a look at the interactive WebAssembly. We can see that our function
gets compiled to a loop. Set loads an item from an
array, multiplies it by 10, and stores the result back. No surprises here. Now, let's get our compiler
to be SIMD enabled. We can see is that, aside
from our regular boilerplate, there is now another
loop that loads four items out of an
array, multiplies them by four instances of number
10, and stores the result back, also in just
one operation. While this improvement
[? is an ?] example, and not a real-world benchmark,
it's interesting to see how such implicit
optimization can help to achieve a
consistent three times increase in performance
of the generated code. In some situations,
however, you don't want to leave it to chance
to have your code optimized this way or your data
has a specific layout or you just want more control
over which features are used. This is where intrinsics
can come in helpful. Intrinsics are
special helpers that look like regular
functions but correspond to specific instructions
on the target. For SIMD in Emscripten
they [? leave ?] in wasm_simd128
header and content all basic operations
for creating, loading, and storing, and operating
at once the supported SIMD vector types. In Rust, the easiest way to use
them is [INAUDIBLE] external packets in [? the ?] crate,
which is intended to be a prototype for a future
[? Standard ?] [? Library ?] API. One important thing to keep
in mind is that SIMD is still experimental and available
only in Chrome [? under ?] [? flag. ?] So just
like with threads, you need to make a separate
build that makes use of SIMD. And then use a feature
detection library to load it, only if it's supported. Now that we've covered
new WebAssembly features, we've got some exciting
tool implements to share with you, too. First if all, earlier
this year, [? LLVM, ?] the compelling infrastructure
behind projects, such as Clang and Rust
and lots of others, has stabilized and finished
support for WebAssembly target. This includes both compilation
of separate source files into WebAssembly object
files, as well as linking them together into the final module. It's not very usable on its own. For example, while it allows
you to compile a separate C/C++ files into WebAssembly, it
doesn't include any standard library. And it expects you
to bring your own. However, it does provide a solid
foundation for other compilers to build on. Let's take a look at Emscripten. Before this, Emscripten had
to maintain a complex, custom compilation pipeline and a
fork of LLVM, called fastcomp. In order to parse an
intermediate representation from Clang, compile
it to asm.js, and when WebAssembly came along,
also converted to WebAssembly. Having to work around
LLVMs, this way, led to various incompatibilities-- [COUGH] [? --reported, ?] [? such ?]
[? as ?] difficulties during upgrades and suboptimal
compilation performance. Now since the
WebAssembly support has been properly
integrated into the LLVM, Emscripten can leverage it
to simplify the compilation process and focus on
providing a great development experience, custom features,
and a standard library, while all core work, for the
features and optimizations, can be continued to
be developed upstream. As an example of
improvements [? reaching ?] to the native backend allowed
Emscripten to significantly improve linking times,
with a small extra cost to its initial compilation. This particularly helps on
incremental development, where you usually modify
and recompile only like one, two files, at a time. And all you need is
a fast linking step. Some projects have seen as
much as seven times improvement in recompilation
times, in such cases. However, there were some
compile-time features, unique to Emscripten,
that were previously handled by the earlier
mentioned fork of LLVM, and could be lost in transition. One of such features
is Asyncify. Normally, when calling from
JavaScript to WebAssembly, and then from WebAssembly
to some Web APIs, you expect to read the result
back, continue execution, and eventually
return to JavaScript. However, many long,
[? grinding, ?] and expensive Web APIs tend to spawn
asynchronous tasks, to avoid blocking the [? main ?]
[? thread. ?] This includes [? Timeless, ?] Fetch
API, Web Crypto API, and lots of [? others. ?] Because WebAssembly does not
have a notion of event loop promises or synchronous
tasks, [INAUDIBLE] would look like
the external API, as soon as it
finished execution. So it can continue running
users code, immediately, while the async task is still
running in the background, with no handlers attached. This is not what
we normally want. We want to not only be able
to start an asynchronous task, but also wait for it to
finish, read the results back, and continuous
execution afterwards. This is where I
Asyncify comes in. I wont go too much into
implementation details here. But what it does is compiles
the WebAssembly module in such a way that you can
suspend execution, remember the state,
and later, resume from the exact same point,
when an asynchronous task has finished its execution. This is quite similar
to await, in JavaScript, but applied to native
functions and with no changes to your own code. In order to use it
from Emscripten, you need to pass a
special parameter, -s asyncify, and specify
which [? imports ?] should be treated as asynchronous. The great news are-- so in your code, you can use
regular function imports. And it evokes them as
any other functions, while Asyncify does
magic under the hood. The great news was that, with
the transition to the upstream LLVM [INAUDIBLE] the backend,
this feature has not gone but was extracted as a separate
transform and can be now used from any languages
and not just C/C++, as long as they
compile to WebAssembly. For example, you
can simply invoke asynchronous
JavaScript functions from Rust, which is
particularly helpful for [? both ?] [INAUDIBLE] standard
synchronous system APIs, available on other platforms. Since you are not
using Emscripten, in this case, after
you have compiled your module into [INAUDIBLE]
using wasm-tool, instead and it will add all the
necessary magic for spending [INAUDIBLE] execution. Then, you'd need some loop on
the JavaScript side, as well. We have [INAUDIBLE] for use. It mimics our regular
WebAssembly API. But [? it allows ?]
instantiates modules with asynchronous
imports and exports. To use it, first, import is
from asyncify-wasm [INAUDIBLE] module. And then, you can use
regular instantiation APIs. But we use asynchronous imports
and exports, in addition to the regular ones. Since now your
WebAssembly module might invoke asynchronous
APIs in arbitrary points, all the exports need to
become asynchronous, too. So you need to [? prefix ?]
[? calls ?] to your exports [? with a write. ?]
And you're good to go. One particularly interesting
use case for Asyncify, aside from external
APIs, is in Emscripten. Emscripten allows you to mark
parts of your code, that's rarely used, and splits them to
a separate WebAssembly module, during compilation. [? add-lazy ?] loads them,
only when it's invoked. This allows us to keep
your initial bundle small, without any breakage to your own
code and with minimal changes. To use it, you need to
call a special function, emscripten_lazy_load_code. During compilation, it will
extract any following code into a separate
WebAssembly module. [? Send ?] during runtime when,
or if, that code is actually reached during
execution, Emscripten will use Asyncify to dynamically
load the missing pieces and continue as if
there was never split, in the first place. This all great features. And it's amazing to see
how our WebAssembly is growing over time. However, with this
feature [? course, ?] the surface area of potential
boxes expanded, as well. When things go wrong,
and we all know, they often do, you want
to be able to track where the problem occurred,
reproduce it step by step, track the inputs that led to
the issue in the first place, and so on. You want to be able to
debug a application. Until recently,
you had two options for debugging WebAssembly. First, you could get
[? your ?] stack traces, as well as step over
individual instructions in that WebAssembly text format. This helps somewhat with
debugging of small isolated functions. But it's not very practical
for larger ops, where the mapping between
the disassembled source and your original
sources is less obvious. To work around this
problem, Emscripten DevTools have initially adapted the
existing source maps format, which was designed for
languages that compile to JavaScript for WebAssembly. This allowed to
map binary offsets, in the compiled module, to the
locations in original sources files. However, this format was
designed for text languages. We use clear mapping to
JavaScript's concepts and values, and not
for binary formats, like WebAssembly, using a memory
arbitrary source languages and arbitrary type systems. This makes the
integration hacky, limited, and not
widely supported outside of Emscripten. On the other hand, many
native languages already have a common
debugging format that contains all the necessary
information for the debugger to resolve locations, variable
names, type layouts, and much more. This format is called DWARF. While there's still some
WebAssembly-specific features, that need to be edited
for full compatibility, compilers like Clang
and Rust already support emitting
DWARF information in a WebAssembly modules,
which allows us to start using directly in DevTools. As a first step, we went ahead
and implemented native source method. So you can start debugging the
WebAssembly modules produced by any of these
compilers, without having to resort to disassembled
format or [INAUDIBLE] scripts for source [? map ?] generation. This integration only covers
stepping in and offers a code in any of these language,
set in breakpoints, and resolving stacks traces. There's still much
more we can do though, such as
[? preprinting ?] types or even evaluating expressions
in the source languages. We are actively
working on bringing this and many other improvements
to the WebAssembly experience. So please stay tuned
for the future updates. And thank you, for
your time, today. [APPLAUSE] [MUSIC PLAYING]