Jens Nockert

Accelerating Javascript via SIMD

Posted at

Mozilla has a bug relating to the lack of SIMD instructions in Javascript. What they want to do is essentially add an assembly language to the web, to allow for a performance increase on computationally intensive code.

This is technology that is directly competing with WebCL and NaCl, and is in many ways a very good one, it would provide many of the advantages of NaCl or WebCL with a different set of disadvantages.

But there are a few problems, it could in theory allow you to build scripts that only run on certain browsers on certain CPUs. Of course, WebGL and typed arrays already provides Javascript with most of this disadvantage, since typed arrays already expose the native endianness of the hardware and WebGL may require specific graphics hardware.

There are a few ways to allow the Javascript programmer high performance primitives for building applications, some better than others.

Raw Assembly - Very High Performance

You could simply let the programmer write a piece of code in assembly as a string, or as a separate script, linked to your Javascript. In the same way that you do it in C with many common compilers, allowing programmers full flexibility when writing code and allowing the programmer access to the lowest level possible.

Advantages

Disadvantages

And while it would allow the programmer to interact with the CPU at the lowest level that also is a significant problem. Ensuring that native code is safe is hard, really hard, which would make any inline assembly proposal get shot down pretty quickly. Another thing that could shoot it down is that it is non-portable between the two major architectures, x86 and x86-64.

This would essentially be the same thing as NaCl.

Standard Intrinsics - High Performance

Let Javascript programmers use intrinsics identical or similar to those exposed by C compilers, these are also architecture specific, or specific to a family of CPUs (for example x86 and x86-64 sharing SSE intrinsics, or PowerPC and Power sharing Altivec/VMX), and generally map to a single or a few instructions.

They are often designed to allow the compiler to do register allocation and provide a friendlier API than the raw instructions. On Intel for example, they are three-operand instead of two-operand.

Advantages

Disadvantages

An API like this would allow programmers to take existing code, rip out the C and replace it with Javascript and the kernel written using intrinsics would run unmodified, allowing a quick speedup in many common algorithms without writing a lot of code.

And the programmer will never expect this code to be portable between processors or browsers, which allows us to remove support in the future, or change the implementation. But on the other hand, future browsers or uncommon processors might simply not work, or run unaccelerated Javascript instead, making the application unnecessarily slow.

On the other hand, with a good compiler and optimizer, it would allow Javascript developers to write application kernels that execute as fast as the hardware allows, which could make the approach quite interesting.

Javascript Specific Intrinsics - High Performance

Javascript specific intrinsics, these would still be CPU specific, and expose all or most functionality of the CPU, but with intrinsics optimized for security and use with Javascript. I guess they would only operate on memory (Typed Arrays), but they could also introduce special vector types.

Advantages

Disadvantages

Such an API would have share most of the advantages and disadvantages with the standard intrinsics, but would trade the ability to use existing code for a nicer API or better performance depending on implementation.

Generic Intrinsics - Low Performance

Providing processor-independent intrinsics that can optionally be accelerated with a SIMD unit, would be the last type of implementation based on intrinsic functions. And it is a quite interesting sort of implementation.

Such an API could in theory be supported on all processors, and accelerate code even on processors without SIMD support (like the nVidia Tegra 2) or on architectures that people won't write platform specific code for (like MIPS).

Advantages

Disadvantages

From a pure performance standpoint it probably is the worst design, it is also the design that would face the least resistance, since it is in theory portable. And would be useful even if it only covers a subset of the common instructions.

Floating-point operations would probably be the most useful subset, since they are probably the most common data type in most WebGL applications, games and simulations. They are also supported pretty evenly in all SIMD implementations we care about, which gives us a pretty good starting point on what would be important.

Even without trying to generate SIMD code, such an API could probably allow for pretty good acceleration, making it low-hanging fruit for browser developers, and for web developers. And combined with a good optimizing JIT it could probably bring performance that is quite close to writing raw assembly language.

Generic Vector/Matrix API - Very Low Flexibility

A Javascript vector/matrix API could expose most of the required floating point functionality that we would get with SIMD, except that it would be much slower for small vectors, making it a lot less useful for WebGL/Games etc.

Just exposing something like BLAS has some advantages, since programmers are used to it, and it has very high-speed implementations on every platform with any kind of support for floating-point. Also, if the system has coprocessors with significant floating point capability (like a GPU), the chance that they implement a fast BLAS is pretty high, which could be important for performance on future embedded platforms like phones or tablets.

Advantages

Disadvantages

While BLAS would provide a high-performance API, I have a hard time seeing that it would be useful in the domains that Javascript is used in, web applications.

There are probably few Javascript applications that solve large systems of linear equations or do many large matrix-matrix multiplications, so while I really like BLAS, I think it will only add complexity to the browsers for no significant gain right now.

It also seems to just be a less flexible version of WebCL, which isn't really what a SIMD API is supposed to compete with.

Conclusion

So there is definitely a point in doing a general API for graphics and floating point operations, it would also be useful for audio processing and so on, but for integer and media instructions, where there is a significant spread in implementations, I cannot really see how a generic API is going to work.

LLVM provides general instructions and types for SIMD, with target specific instructions in addition to that, so some sort of mix is definitely possible where there are a basic subset that is enabled on all CPUs and a larger set of media instructions and instructions that are hard to emulate quickly that are only available on specific targets.

For Aurora.js (my multimedia framework), I don't think most of the floating point operations will be of much use, and most of a generic SIMD API would probably be concerned with floats, at least to begin with, but I am still interested on working on a possible proposal that would provide a few basic primitives that could really help some of the audio processing, and probably help graphics related tasks a lot.

But the most important thing is probably that it could accelerate graphics intensive applications quite a bit, for example when doing matrix and vector math on fixed-length vectors, something that I assume is pretty common in most WebGL applications and most games, but could probably be useful even on `normal' web applications.