Problems with a ‘pure Javascript’ implementation of H.264
I have written a lot of audio decoders in Javascript, and helped write a few more. I have never tackled video for a few reasons, and I’ll try to sum up why there will probably never be one implemented in ‘pure’ Javascript, and the methods with which I think it will be implemented instead.
Even the most high-end Audio codecs are also designed to work on really low-end DSP devices. ALAC (Apple Lossless Audio Codec) for example, decodes stereo fine in software on one of the 90 MHz ARM 7TDMI cores in the original iPod. AAC requires a bit more, but it is still within the reach of software on a relatively slow processor, like a Pentium or G3. A modern ARM processor can decode MP3 at a clock-speed of mere 10MHz, and with a bit more, AAC, which essentially is the most demanding codec that you’ll meet on the web.
Video codecs on the other hand are an entirely different story. The 2.4GHz Core 2 Duo in my laptop (a Macbook Pro) has serious problems decoding high-end (1080p Hi10P for example) H.264 in with FFmpeg. My desktop, a reasonably modern Xeon quad-core, handles these videos fine using FFmpeg, but with significant load. Note that this is with an implementation that is hand-optimized with assembly. To improve the situation, we cannot depend on hardware support either, because it is often out of date. No graphics card in my collection support this profile in hardware yet for example.
On top of these problems, there are some serious limitations in Javascript/ECMAScript that makes it a bad platform for video decoding. And while it is a very cool demo of emscripten, these are some of the reasons why I don’t think that Broadway.js will ever be able to decode H.264 in any sort of sane capacity using merely emscripten and some minor optimizations done by hand without a radically different Javascript engine to support it.
Floating-point
Essentially all operations in Javascript operate on floating-point numbers, and this is not likely to change in the future. For audio codecs, this is not really a problem since they tend to be designed in a way that you can implement them as both fixed-point and floating-point.
Video codecs on the other hand tend to rely a lot on fixed-point for optimization, H.264 is even optimized to avoid needing floating-point as much as possible. Even the discrete cosine transform and motion compensation in H.264 is modified to operate on fixed-point numbers instead of floating-point.
The reason for this is that modern processors can often process fixed-point operations much faster, especially the 8 and 16 bit operations that are the most common. These short integer instructions often have at least 4 times faster thoroughput than double precision floating-point. Certain complex instructions like division make the difference irrelevant, and in many cases require fallback to floating-point, but these operations are extremely uncommon in H.264.
SIMD
This is before the SIMD penalty is added for Javascript, because current Javascript engines utilize only scalar operations, a significant part of the execution hardware (1/2 to 1/4) spends most of the time idling.
Most decoders utilize these SIMD instructions, which gives them access to 8-16 times more throughput per core for simple operations. And on top of that, there are special instructions for optimizing MPEG codecs, giving a quite measurable speedup on top of that, which you are unlikely to be able to utilize without hand-optimized code.
Threads
To provide the final blow against current Javascript, there is very little possibility for shared memory multicore programming in a browser. Workers are not good enough to do this, I haven’t actually measured this as I do not plan to implement a video decoder with workers, but I think that the cost of communication and latency is currently too high for it to make sense.
Only using a single core on a processor that has 2-8 is another problem that would keep a Javascript implementation from ever competing with a native implementation.
Solution
There are two obvious solutions to all of these problems that are being prototyped on the web right now, WebCL and Rivertrail. Both of these are designed to solve the threading problem mainly, which is likely not the biggest issue, but it is still significant.
Rivertrail could solve most of these problems since it is currently based on Intel’s OpenCL runtime, which has good optimizers. It isn’t designed for this very specialized task, and while it does allow you to reduce precision, it doesn’t allow direct access to integer or media instructions, but it is a much better option than pure Javascript and with the addition of an integer API to Javascript, this could easily turn into the preferred method.
WebCL (OpenCL for the web) on the other hand, already solves most or all of these problems since it is essentially a massively parallel C with SIMD and device-specific extensions. It even allows for the GPU to pick up most of the burden, which is in many cases preferable to running on the CPU due to the extra computational power available.
There are probably other solutions as well, but just hoping that single-threaded Javascript with double precision floating-point will ever be enough is naïve and counter-productive in my opinion. Especially on mobile devices, which have special concerns, WebCL is in a great position to solve these too in the future.
And while I would love to be proved wrong about this, I don’t think I will be for a long while, and at that point, there will be more advanced codecs and higher resolutions around to target.
Accelerating Javascript via SIMD
Mozilla has a bug relating to the lack of SIMD instructions in Javascript. What they want to do is essentially add an assembly language to the web, to allow for a performance increase on computationally intensive code.
This is technology that is directly competing with WebCL and NaCl, and is in many ways a very good one, it would provide many of the advantages of NaCl or WebCL with a different set of disadvantages.
But there are a few problems, it could in theory allow you to build scripts that only run on certain browsers on certain CPUs. Of course, WebGL and typed arrays already provides Javascript with most of this disadvantage, since typed arrays already expose the native endianness of the hardware and WebGL may require specific graphics hardware.
There are a few ways to allow the Javascript programmer high performance primitives for building applications, some better than others.
Raw Assembly - Very High Performance
You could simply let the programmer write a piece of code in assembly as a string, or as a separate script, linked to your Javascript. In the same way that you do it in C with many common compilers, allowing programmers full flexibility when writing code and allowing the programmer access to the lowest level possible.
Advantages
- Very powerful (all CPU-specific features exposed)
- Very fast
- Much code already available
- No one expects it to be portable
Disadvantages
- How do we know the code is safe?
- A language within a language, with very different semantics
- keeping track of register usage, restrict memory accesses
And while it would allow the programmer to interact with the CPU at the lowest level that also is a significant problem. Ensuring that native code is safe is hard, really hard, which would make any inline assembly proposal get shot down pretty quickly. Another thing that could shoot it down is that it is non-portable between the two major architectures, x86 and x86-64.
This would essentially be the same thing as NaCl.
Standard Intrinsics - High Performance
Let Javascript programmers use intrinsics identical or similar to those exposed by C compilers, these are also architecture specific, or specific to a family of CPUs (for example x86 and x86-64 sharing SSE intrinsics, or PowerPC and Power sharing Altivec/VMX), and generally map to a single or a few instructions.
They are often designed to allow the compiler to do register allocation and provide a friendlier API than the raw instructions. On Intel for example, they are three-operand instead of two-operand.
Advantages
- Very powerful (most CPU-specific features exposed)
- Very fast (with a good compiler and optimizer)
- Much code already available
- No one expects it to be portable
Disadvantages
- May need to add data types (m64… for Intel, float32x4_t… for ARM NEON)
- Possibly hard to restrict memory accesses
- Modifies the Javascript runtime
An API like this would allow programmers to take existing code, rip out the C and replace it with Javascript and the kernel written using intrinsics would run unmodified, allowing a quick speedup in many common algorithms without writing a lot of code.
And the programmer will never expect this code to be portable between processors or browsers, which allows us to remove support in the future, or change the implementation. But on the other hand, future browsers or uncommon processors might simply not work, or run unaccelerated Javascript instead, making the application unnecessarily slow.
On the other hand, with a good compiler and optimizer, it would allow Javascript developers to write application kernels that execute as fast as the hardware allows, which could make the approach quite interesting.
Javascript Specific Intrinsics - High Performance
Javascript specific intrinsics, these would still be CPU specific, and expose all or most functionality of the CPU, but with intrinsics optimized for security and use with Javascript. I guess they would only operate on memory (Typed Arrays), but they could also introduce special vector types.
Advantages
- Powerful (all features we need could be exposed)
- Very fast (with a good compiler and optimizer)
- Nicer syntax
- No one expects it to be portable
- Could be designed to allow for Javascript fallbacks
Disadvantages
- Hard to implement 64-bit integer support
- Possibly hard to restrict memory accesses
- Modifies the Javascript runtime
Such an API would have share most of the advantages and disadvantages with the standard intrinsics, but would trade the ability to use existing code for a nicer API or better performance depending on implementation.
Generic Intrinsics - Low Performance
Providing processor-independent intrinsics that can optionally be accelerated with a SIMD unit, would be the last type of implementation based on intrinsic functions. And it is a quite interesting sort of implementation.
Such an API could in theory be supported on all processors, and accelerate code even on processors without SIMD support (like the nVidia Tegra 2) or on architectures that people won’t write platform specific code for (like MIPS).
Advantages
- Portable
- Nice syntax
- Could still be much faster than `pure’ Javascript
- Would be a nice way to add 64-bit integers
Disadvantages
- Modifies the Javascript runtime
- Needs fallbacks on unsupported architectures
- Does not expose all capabilities
- Exposes non-accelerated capabilities
- Possibly very complicated
From a pure performance standpoint it probably is the worst design, it is also the design that would face the least resistance, since it is in theory portable. And would be useful even if it only covers a subset of the common instructions.
Floating-point operations would probably be the most useful subset, since they are probably the most common data type in most WebGL applications, games and simulations. They are also supported pretty evenly in all SIMD implementations we care about, which gives us a pretty good starting point on what would be important.
Even without trying to generate SIMD code, such an API could probably allow for pretty good acceleration, making it low-hanging fruit for browser developers, and for web developers. And combined with a good optimizing JIT it could probably bring performance that is quite close to writing raw assembly language.
Generic Vector/Matrix API - Very Low Flexibility
A Javascript vector/matrix API could expose most of the required floating point
functionality that we would get with SIMD, except that it would be much slower
for small vectors, making it a lot less useful for WebGL/Games etc.
Just exposing something like BLAS has some advantages, since programmers are
used to it, and it has very high-speed implementations on every platform with
any kind of support for floating-point. Also, if the system has coprocessors
with significant floating point capability (like a GPU), the chance that they
implement a fast BLAS is pretty high, which could be important for performance
on future embedded platforms like phones or tablets.
Advantages
- Good for scientific computing
- Easy to use
- Very optimized
- Safe
- Portable
Disadvantages
- No media instruction support
- Slow for small vectors/matrices
While BLAS would provide a high-performance API, I have a hard time seeing that
it would be useful in the domains that Javascript is used in, web applications.
There are probably few Javascript applications that solve large systems of
linear equations or do many large matrix-matrix multiplications, so while I
really like BLAS, I think it will only add complexity to the browsers for no
significant gain right now.
It also seems to just be a less flexible version of WebCL, which isn’t really what a SIMD API is supposed to compete with.
Conclusion
So there is definitely a point in doing a general API for graphics and floating point operations, it would also be useful for audio processing and so on, but for integer and media instructions, where there is a significant spread in implementations, I cannot really see how a generic API is going to work.
LLVM provides general instructions and types for SIMD, with target specific instructions in addition to that, so some sort of mix is definitely possible where there are a basic subset that is enabled on all CPUs and a larger set of media instructions and instructions that are hard to emulate quickly that are only available on specific targets.
For Aurora.js (my multimedia framework), I don’t think most of the floating point operations will be of much use, and most of a generic SIMD API would probably be concerned with floats, at least to begin with, but I am still interested on working on a possible proposal that would provide a few basic primitives that could really help some of the audio processing, and probably help graphics related tasks a lot.
But the most important thing is probably that it could accelerate graphics intensive applications quite a bit, for example when doing matrix and vector math on fixed-length vectors, something that I assume is pretty common in most WebGL applications and most games, but could probably be useful even on `normal’ web applications.