THOMAS NATTESTAD: Hi, everyone. My name is Thomas and together with my colleague Ingvar here, we're going to show you how using WebAssembly can speed up your computationally intensive workloads by more than 10x. And how using modern WebAssembly tooling can let you take advantage of WebAssembly more easily. We'll start by reminding everyone what WebAssembly is and showing some of the improvements we've been making to Chrome's implementation. Then we're going to get into some of the different language features that are starting to ship as part of WebAssembly. And then finally, we'll close out by covering some of the new tooling updates that have been coming as well. So let's start by reminding everyone what WebAssembly actually is. WebAssembly is a new language for the web that is designed as a compilation target to offer maximized and reliable performance. It's important to remember, though, that WebAssembly is in no way meant to replace JavaScript. Rather, it's meant to augment the things that JavaScript was never designed to do. So let's look at some of the different advantages of WebAssembly and why you might want to use it. First, because WebAssembly offers strong type guarantees, it gives you more consistent and reliable performance than JavaScript. Then, with additional features like threads and Cindy, which will get more into later, you can also achieve speeds that are truly higher than what you can with JavaScript. When thinking about comparing baseline performance WebAssembly to JavaScript, I find this metaphor which my colleague [? Sama ?] came up with really useful. JavaScript is like running along a tightrope. It's possible to go fast, but it requires a lot of skill and it's possible to fall off the fast path. Whereas baseline WebAssembly is more like running along a train track. You don't have to be as careful in order to go fast. Another advantage of WebAssembly is its amazing portability. Because you can compile from other languages, you can bring not only your own code bases and libraries to the web, but also the incredible wealth of open source libraries built in languages like C++. Lastly, and potentially most exciting to many of you out there, is the possibility of more flexibility when writing for the web. Specifically, the ability to write in other languages. Since the web's inception, JavaScript has been the only fully supported option. And now through WebAssembly, you get more choice. Most exciting though, is the fact that WebAssembly is now shipping in all major browsers, making it the first new language to ship in every major browser since JavaScript was created more than 20 years ago. So now that we all are reminded of what WebAssembly actually is, I want to cover some of the improvements that we've been making directly in Chrome. One of the biggest requests that we've heard from our developers is the desire for faster startup time. To improve startup time for WebAssembly modules, we're starting to roll out something we're calling implicit caching. To recap, when a site loads a WebAssembly module, it first goes into the lift off compiler to start executing immediately. It then is further optimized off the main thread through the turbo fan optimizing compiler, and then the result is hot swapped in when ready. Now, with implicit caching, we also cache that optimized WebAssembly module directly in the HTTP cache. Then, after the user leaves the page and comes back, we load that optimized module directly from the cache, resulting in immediate top tier performance. As the name suggests, implicit caching happens automatically. But there are two tips worth knowing and keeping in mind. The first, is that code caching in WebAssembly works off of the streaming APIs. So make sure it's always used compile streaming or instantiate streaming. The second thing is just to make sure that you're being cache friendly. WebAssembly keeps the cache based on the URL of the WebAssembly module. So if this changes on each load, you won't see any of the benefits. In addition to new features like implicit caching, we're also always making improvements to our WebAssembly engine. Here you can see how commit by commit, we've cut startup time by almost half since just the start of this last year. OK, so now that we've covered some of the improvements that have been made in Chrome, I want to get into some of the actual new language features of WebAssembly. The first feature that I want to talk about is WebAssembly threads. Threads are a key part of practically all CPUs, and utilizing them fully and effectively has been one of the great challenges for the web until now. WebAssembly threads work by relying on three specific things-- Web Workers, SharedArrayBuffer, and atomic operations. Web Workers allows WebAssembly to run on different CPU cores. Then SharedArrayBuffer allows WebAssembly to operate on the same piece of memory. Lastly, atomic operations, specifically atomic.wake and atomic.notify, let you synchronize your WebAssembly so that things happen in the right order. Google Earth adapted WebAssembly threads with great success. They saw their frame rate almost double and their number of dropped frames cut by more than half. Soundation, a music editing studio, similarly adopted threads to enable highly efficient paralization. As they increased their number of threads, they saw their performance more than triple. One application that I'm particularly excited to share is coming to the web through WebAssembly threads, is VLC. They were able to originally compile their code base to baseline WebAssembly. But without threads, they weren't able to achieve anything close to the performance that they needed. Now thanks to threads, they have a working prototype working directly in Chrome. So going back to our analogy from earlier, if baseline WebAssembly it's like running along a train track, WebAssembly with threads is like an actual train. You're achieving speeds that were previously impossible. Threads have been available in Chrome desktop since version 74. In Android, Chrome, and Firefox, threads are implemented, but not enabled by default. We're actively working with other browser vendors and the WebAssembly community to make threads available in more places. [? Send ?] threads are not supported everywhere. It's critical to use feature detection before relying on their presence, which Ingvar will now show you how to do. INGVAR STEPANYAN: Thank you, Thomas. Unfortunately, WebAssembly does not have a built-in feature detection yet, although it's being actively worked on. For now, we created a JavaScript library instead that you can use to detect WebAssembly features supported by your browser. This allows you to build several versions of your WebAssembly module, for different feature sets, just like you would for modern JavaScript bundles and dynamically choose the ones that your browser can handle. For example, you can use threads function in order to detect [INAUDIBLE] [? browse ?] [? simple ?] threads in WebAssembly. Then you can use dynamic input to load either version of your WebAssembly module and the JavaScript binded set makes user threads for optimizations, or regular one for the older browsers. How do you build a version for threads, in the first place? If you're using a script and you need to pass an argument -pthread during compilation, like you would to regular, native C compilers. And it will automatically generate the WebAssembly module and the JavaScript necessary for creating, managing, and communicating with the Web Workers under the hood. If you aren't in C [INAUDIBLE] allows you to use common POSIX thread APIs, just like those available on native Unix platforms. For example, you can use pthread_create with the handler function and arguments, in order to start a new thread and [? writing ?] the code pthread_join in order to wait for it to finish and read the results back. If you write in C++, good news has it, Emscripten [? implemented ?] an implementation of standard thread APIs, just like in Unix makes use of POSIX threads under the hood. And other high level APIs, such as std::async, makes use of std::thread at the C++ standard level. So they all just work. This means that, for example, you can use std::thread with closures in the C++ code. And it will [? lower ?] to the same pthread goals and handled by Emscripten. Similarly, you can use std::async APIs to spawn futures, which are quite similar to JavaScript promises, but allow you to spawn tasks on your threads. And the [INAUDIBLE] this stories, not just [? been ?] [? fleshed ?] out, as you need to maybe create Web Workers, send them to WebAssembly module and [? memories ?] that you want to share, as well as rebuild the standard library with thread support. However, after jumping through a few hoops, you are able to even use popular multi-threading libraries, like [? Ryan, ?] like in this demo by Rust WebAssembly team. Here, they [? brought ?] [? in ?] a ray tracer to split and read into several threads and compiled it to WebAssembly. You can see how, with a single thread, it takes 1.7 seconds to render the entire image. But if you split working, to say, four threads, it takes only 0.8 seconds, making it more than two times faster. Another performance feature that is making its way into WebAssembly is SIMD. And I'd like to invite Thomas back, to tell us what it is and how it can help us. THOMAS NATTESTAD: Thank you, so much, Ingvar So, SIMD stands for Single Instruction Multiple Data. And while this may not be a term that most web developers are familiar with, it's an absolutely key part of modern CPU architectures. So to explain SIMD, let's take this simple example of adding two arrays together into a third array, using a simple for loop. Without SIMD, the CPU goes through this loop and adds the different elements together, one by one, taking four full steps. Now, with SIMD, the CPU is able to vectorize these elements and then take just a single CPU operation to add them together. This may seem simple, but it can have dramatic impacts on performance. To show the power that SIMD can deliver, I want to show off some of the work done by our colleagues at Google Research. They've developed several real-time ML models that can do everything from letting you try on fake glasses or puppet masks, doing dynamic background removal, and much more. One of the coolest demos is this hand tracking system. And here, you can really see the difference that SIMD makes. Without SIMD, you're only getting about three frames per second, while with SIMD, you've got a much smoother 15 frames per second, which makes all the difference. You can visit this link to check these out for yourself or come by the sandbox to play with them. The Google research team looked at a bunch of their models and found that, in general, SIMD offered a 3x improvement on overall speed. The next example that I want to show off is OpenCV and some of the work done by our friends at Intel and UC Irvine. OpenCV is an extremely popular image analysis library that has tons of performance dependent functionality. OpenCV can be compiled to WebAssembly and run directly in the browser. It can be used for doing things, like card reading, replacing real emotions with emojis, and for all the Harry Potter fans out there, you can now have your very own web-powered invisibility cloak. You can visit this link to try them out. Or again, come by the sandbox to check and see them there. This work has actually been fully upstreamed into OpenCV. And they even have a tutorial on how to setup OpenCV with the Emscripten, so that you can all play with this yourself, at home. And all of this functionality can take advantage of threads and SIMD to dramatically improve performance. Here we can see the visual difference of first adding SIMD and then SIMD plus threads. And our benchmarking backs up this visually noticeable difference. When using both threads and SIMD together, common tasks in OpenCV can be improved by around 15x. And some of the benchmarks show even more dramatic improvements from threads and SIMD. For the OpenCV kernel performance test, using threads gives you a 3.5x improvement. And using SIMD gives you an even more impressive 9x improvement, just by itself. And then when you take these together, it results in an overall 30x improvement to this performance test, which is truly staggering. So coming back to our train analogy, because who doesn't love trains, if WebAssembly threads is like an old-style train, using threads and SIMD together is like a modern bullet train. So to show you how to actually take advantage of this in code, I'd like to hand it back to Ingvar. INGVAR STEPANYAN: Thanks, Thomas. To build code with SIMD and Emscripten, you need to pass a special parameter -m, which tells Dandelion's [? sealant ?] compiler to enable a specific feature, followed by simd128, which is the feature name for the currently supported 128-bit SIMD operations in WebAssembly. In Rust, you need to pass the same feature name, by a -C target-feature compiler flag. The easiest way to do this on a real project, using cargo wasm-pac is currently [? serene ?] environment variable RUSTFLAGS, passed during compilation. Now that we've covered how to compile our code, let's see what it takes to actually use SIMD in our code. The good news has it, in the simplest case, the answer is nothing. That is, unlike with threads, SIMD [INAUDIBLE] compiler can often make advantage of, and take care of, without you having to modify any code at all. This compiler feature is called auto-vectorization. And it detects loops that perform [? same ?] mathematical operations on array items, independently. For example, let's take a look at this simple code in C. On [INAUDIBLE] one in C++ All the same one, in Rust. Such a loop operates on an array of numbers. Check. It performs arithmetic operations. Also, check. And it clearly operates as an independent [INAUDIBLE] Also, check. So the compiler should be able to make use of SIMD to process several elements at once-- [? Ryzen ?] handles them by one-to-one-- and make it faster. Let's see if it does. First, let's compile this code, in any of the source languages, without SIMD enabled and take a look at the interactive WebAssembly. We can see that our function gets compiled to a loop. Set loads an item from an array, multiplies it by 10, and stores the result back. No surprises here. Now, let's get our compiler to be SIMD enabled. We can see is that, aside from our regular boilerplate, there is now another loop that loads four items out of an array, multiplies them by four instances of number 10, and stores the result back, also in just one operation. While this improvement [? is an ?] example, and not a real-world benchmark, it's interesting to see how such implicit optimization can help to achieve a consistent three times increase in performance of the generated code. In some situations, however, you don't want to leave it to chance to have your code optimized this way or your data has a specific layout or you just want more control over which features are used. This is where intrinsics can come in helpful. Intrinsics are special helpers that look like regular functions but correspond to specific instructions on the target. For SIMD in Emscripten they [? leave ?] in wasm_simd128 header and content all basic operations for creating, loading, and storing, and operating at once the supported SIMD vector types. In Rust, the easiest way to use them is [INAUDIBLE] external packets in [? the ?] crate, which is intended to be a prototype for a future [? Standard ?] [? Library ?] API. One important thing to keep in mind is that SIMD is still experimental and available only in Chrome [? under ?] [? flag. ?] So just like with threads, you need to make a separate build that makes use of SIMD. And then use a feature detection library to load it, only if it's supported. Now that we've covered new WebAssembly features, we've got some exciting tool implements to share with you, too. First if all, earlier this year, [? LLVM, ?] the compelling infrastructure behind projects, such as Clang and Rust and lots of others, has stabilized and finished support for WebAssembly target. This includes both compilation of separate source files into WebAssembly object files, as well as linking them together into the final module. It's not very usable on its own. For example, while it allows you to compile a separate C/C++ files into WebAssembly, it doesn't include any standard library. And it expects you to bring your own. However, it does provide a solid foundation for other compilers to build on. Let's take a look at Emscripten. Before this, Emscripten had to maintain a complex, custom compilation pipeline and a fork of LLVM, called fastcomp. In order to parse an intermediate representation from Clang, compile it to asm.js, and when WebAssembly came along, also converted to WebAssembly. Having to work around LLVMs, this way, led to various incompatibilities-- [COUGH] [? --reported, ?] [? such ?] [? as ?] difficulties during upgrades and suboptimal compilation performance. Now since the WebAssembly support has been properly integrated into the LLVM, Emscripten can leverage it to simplify the compilation process and focus on providing a great development experience, custom features, and a standard library, while all core work, for the features and optimizations, can be continued to be developed upstream. As an example of improvements [? reaching ?] to the native backend allowed Emscripten to significantly improve linking times, with a small extra cost to its initial compilation. This particularly helps on incremental development, where you usually modify and recompile only like one, two files, at a time. And all you need is a fast linking step. Some projects have seen as much as seven times improvement in recompilation times, in such cases. However, there were some compile-time features, unique to Emscripten, that were previously handled by the earlier mentioned fork of LLVM, and could be lost in transition. One of such features is Asyncify. Normally, when calling from JavaScript to WebAssembly, and then from WebAssembly to some Web APIs, you expect to read the result back, continue execution, and eventually return to JavaScript. However, many long, [? grinding, ?] and expensive Web APIs tend to spawn asynchronous tasks, to avoid blocking the [? main ?] [? thread. ?] This includes [? Timeless, ?] Fetch API, Web Crypto API, and lots of [? others. ?] Because WebAssembly does not have a notion of event loop promises or synchronous tasks, [INAUDIBLE] would look like the external API, as soon as it finished execution. So it can continue running users code, immediately, while the async task is still running in the background, with no handlers attached. This is not what we normally want. We want to not only be able to start an asynchronous task, but also wait for it to finish, read the results back, and continuous execution afterwards. This is where I Asyncify comes in. I wont go too much into implementation details here. But what it does is compiles the WebAssembly module in such a way that you can suspend execution, remember the state, and later, resume from the exact same point, when an asynchronous task has finished its execution. This is quite similar to await, in JavaScript, but applied to native functions and with no changes to your own code. In order to use it from Emscripten, you need to pass a special parameter, -s asyncify, and specify which [? imports ?] should be treated as asynchronous. The great news are-- so in your code, you can use regular function imports. And it evokes them as any other functions, while Asyncify does magic under the hood. The great news was that, with the transition to the upstream LLVM [INAUDIBLE] the backend, this feature has not gone but was extracted as a separate transform and can be now used from any languages and not just C/C++, as long as they compile to WebAssembly. For example, you can simply invoke asynchronous JavaScript functions from Rust, which is particularly helpful for [? both ?] [INAUDIBLE] standard synchronous system APIs, available on other platforms. Since you are not using Emscripten, in this case, after you have compiled your module into [INAUDIBLE] using wasm-tool, instead and it will add all the necessary magic for spending [INAUDIBLE] execution. Then, you'd need some loop on the JavaScript side, as well. We have [INAUDIBLE] for use. It mimics our regular WebAssembly API. But [? it allows ?] instantiates modules with asynchronous imports and exports. To use it, first, import is from asyncify-wasm [INAUDIBLE] module. And then, you can use regular instantiation APIs. But we use asynchronous imports and exports, in addition to the regular ones. Since now your WebAssembly module might invoke asynchronous APIs in arbitrary points, all the exports need to become asynchronous, too. So you need to [? prefix ?] [? calls ?] to your exports [? with a write. ?] And you're good to go. One particularly interesting use case for Asyncify, aside from external APIs, is in Emscripten. Emscripten allows you to mark parts of your code, that's rarely used, and splits them to a separate WebAssembly module, during compilation. [? add-lazy ?] loads them, only when it's invoked. This allows us to keep your initial bundle small, without any breakage to your own code and with minimal changes. To use it, you need to call a special function, emscripten_lazy_load_code. During compilation, it will extract any following code into a separate WebAssembly module. [? Send ?] during runtime when, or if, that code is actually reached during execution, Emscripten will use Asyncify to dynamically load the missing pieces and continue as if there was never split, in the first place. This all great features. And it's amazing to see how our WebAssembly is growing over time. However, with this feature [? course, ?] the surface area of potential boxes expanded, as well. When things go wrong, and we all know, they often do, you want to be able to track where the problem occurred, reproduce it step by step, track the inputs that led to the issue in the first place, and so on. You want to be able to debug a application. Until recently, you had two options for debugging WebAssembly. First, you could get [? your ?] stack traces, as well as step over individual instructions in that WebAssembly text format. This helps somewhat with debugging of small isolated functions. But it's not very practical for larger ops, where the mapping between the disassembled source and your original sources is less obvious. To work around this problem, Emscripten DevTools have initially adapted the existing source maps format, which was designed for languages that compile to JavaScript for WebAssembly. This allowed to map binary offsets, in the compiled module, to the locations in original sources files. However, this format was designed for text languages. We use clear mapping to JavaScript's concepts and values, and not for binary formats, like WebAssembly, using a memory arbitrary source languages and arbitrary type systems. This makes the integration hacky, limited, and not widely supported outside of Emscripten. On the other hand, many native languages already have a common debugging format that contains all the necessary information for the debugger to resolve locations, variable names, type layouts, and much more. This format is called DWARF. While there's still some WebAssembly-specific features, that need to be edited for full compatibility, compilers like Clang and Rust already support emitting DWARF information in a WebAssembly modules, which allows us to start using directly in DevTools. As a first step, we went ahead and implemented native source method. So you can start debugging the WebAssembly modules produced by any of these compilers, without having to resort to disassembled format or [INAUDIBLE] scripts for source [? map ?] generation. This integration only covers stepping in and offers a code in any of these language, set in breakpoints, and resolving stacks traces. There's still much more we can do though, such as [? preprinting ?] types or even evaluating expressions in the source languages. We are actively working on bringing this and many other improvements to the WebAssembly experience. So please stay tuned for the future updates. And thank you, for your time, today. [APPLAUSE] [MUSIC PLAYING]
