Problems with a ‘pure Javascript’ implementation of H.264

I have written a lot of audio decoders in Javascript, and helped write a few more. I have never tackled video for a few reasons, and I’ll try to sum up why there will probably never be one implemented in ‘pure’ Javascript, and the methods with which I think it will be implemented instead.

Even the most high-end Audio codecs are also designed to work on really low-end DSP devices. ALAC (Apple Lossless Audio Codec) for example, decodes stereo fine in software on one of the 90 MHz ARM 7TDMI cores in the original iPod. AAC requires a bit more, but it is still within the reach of software on a relatively slow processor, like a Pentium or G3. A modern ARM processor can decode MP3 at a clock-speed of mere 10MHz, and with a bit more, AAC, which essentially is the most demanding codec that you’ll meet on the web.

Video codecs on the other hand are an entirely different story. The 2.4GHz Core 2 Duo in my laptop (a Macbook Pro) has serious problems decoding high-end (1080p Hi10P for example) H.264 in with FFmpeg. My desktop, a reasonably modern Xeon quad-core, handles these videos fine using FFmpeg, but with significant load. Note that this is with an implementation that is hand-optimized with assembly. To improve the situation, we cannot depend on hardware support either, because it is often out of date. No graphics card in my collection support this profile in hardware yet for example.

On top of these problems, there are some serious limitations in Javascript/ECMAScript that makes it a bad platform for video decoding. And while it is a very cool demo of emscripten, these are some of the reasons why I don’t think that Broadway.js will ever be able to decode H.264 in any sort of sane capacity using merely emscripten and some minor optimizations done by hand without a radically different Javascript engine to support it.

Floating-point

Essentially all operations in Javascript operate on floating-point numbers, and this is not likely to change in the future. For audio codecs, this is not really a problem since they tend to be designed in a way that you can implement them as both fixed-point and floating-point.

Video codecs on the other hand tend to rely a lot on fixed-point for optimization, H.264 is even optimized to avoid needing floating-point as much as possible. Even the discrete cosine transform and motion compensation in H.264 is modified to operate on fixed-point numbers instead of floating-point.

The reason for this is that modern processors can often process fixed-point operations much faster, especially the 8 and 16 bit operations that are the most common. These short integer instructions often have at least 4 times faster thoroughput than double precision floating-point. Certain complex instructions like division make the difference irrelevant, and in many cases require fallback to floating-point, but these operations are extremely uncommon in H.264.

SIMD

This is before the SIMD penalty is added for Javascript, because current Javascript engines utilize only scalar operations, a significant part of the execution hardware (1/2 to 1/4) spends most of the time idling.

Most decoders utilize these SIMD instructions, which gives them access to 8-16 times more throughput per core for simple operations. And on top of that, there are special instructions for optimizing MPEG codecs, giving a quite measurable speedup on top of that, which you are unlikely to be able to utilize without hand-optimized code.

Threads

To provide the final blow against current Javascript, there is very little possibility for shared memory multicore programming in a browser. Workers are not good enough to do this, I haven’t actually measured this as I do not plan to implement a video decoder with workers, but I think that the cost of communication and latency is currently too high for it to make sense.

Only using a single core on a processor that has 2-8 is another problem that would keep a Javascript implementation from ever competing with a native implementation.

Solution

There are two obvious solutions to all of these problems that are being prototyped on the web right now, WebCL and Rivertrail. Both of these are designed to solve the threading problem mainly, which is likely not the biggest issue, but it is still significant.

Rivertrail could solve most of these problems since it is currently based on Intel’s OpenCL runtime, which has good optimizers. It isn’t designed for this very specialized task, and while it does allow you to reduce precision, it doesn’t allow direct access to integer or media instructions, but it is a much better option than pure Javascript and with the addition of an integer API to Javascript, this could easily turn into the preferred method.

WebCL (OpenCL for the web) on the other hand, already solves most or all of these problems since it is essentially a massively parallel C with SIMD and device-specific extensions. It even allows for the GPU to pick up most of the burden, which is in many cases preferable to running on the CPU due to the extra computational power available.

There are probably other solutions as well, but just hoping that single-threaded Javascript with double precision floating-point will ever be enough is naïve and counter-productive in my opinion. Especially on mobile devices, which have special concerns, WebCL is in a great position to solve these too in the future.

And while I would love to be proved wrong about this, I don’t think I will be for a long while, and at that point, there will be more advanced codecs and higher resolutions around to target.

Testing numerical accuracy of browsers

According to the standard, only the arithmetic operations in Javascript need to be correctly rounded, the functions in Math does not have any accuracy requirements.

But out in the real world, browsers are a bit better than that, we have a feeling that the functions in Math are reasonably accurate, but if you need to be convinced (like me) then you should look at https://github.com/JensNockert/accuracy.js which fuzz tests most of the operations in Math that have a tendency to be inaccurate.

If you want to be even more convinced, generate more test cases using generate.rb.

Ps. sin, cos and tan are missing, their periodicity makes them hard to fuzz using this technique.

Update

  1. I fuzzed on Windows as well, and Chrome on Windows does not provide sqrt with correct rounding, a bug has been filed. Firefox and Opera provide as much precision on Windows as on OS X.

Environment and Feature Detection

A warning, most of these extensions may have extreme security issues currently, they are prototypes after all. Use a separate browser instance and profile for this series.

If you already have a working install of WebCL and River Trail, you can skip this part. Do not assume that you have them just because you have the latest version of your browser, because they are not available yet in any browser without a special build or extension.

Preparation

The first step is to install OpenCL drivers for all your devices.

If you are running OS X (Snow Leopard or Lion) then all OpenCL drivers are already installed and you’re good to go.

If you are running Linux or Windows, then you might need to install some OpenCL drivers. If your CPU is supported by both the AMD and Intel SDKs then I recommend you to install both.

If you have an nVidia or AMD graphics card, you probably already have OpenCL drivers installed for them (they are included in the graphics drivers), but you should make sure they are the latest version.

Now we’re onto a few semi-optional tools that you probably want, but can avoid them if you want to,

Git is my version control system of choice, and you will probably want to check out the repositories of examples and on Github.

Coffeescript is a thin wrapper around Javascript that I happen to like, it has a bit more Pythonesque syntax and is really nice. A lot of the support libraries, and a few of the examples will be written in Coffeescript, and you might want to be able to recompile them.

Installing WebCL

Now we can start installing the WebCL prototypes.

If you are running Linux or Windows, then you need to first install a 32-bit version of Firefox 10 and make sure that you have installed Firebug into your new profile, then you can install the Nokia WebCL prototype.

If you are running OS X, then you need to install the Samsung WebCL prototype based on WebKit. It is a bit complicated since you need to compile it from scratch.

Just follow the included readme, after a while into the build, you might meet some compilation errors but they are easily fixable.

Installing River Trail

To allow River Trail code to be accelerated via OpenCL you need to install the River Trail extension.

On Windows or Linux it really does need the Intel OpenCL driver, the AMD OpenCL driver or the driver for your graphics card is not enough. But if your computer does not support the Intel OpenCL then you can still execute River Trail code using a normal Javascript engine.

On OS X, the built in OpenCL drivers are fine.

Detecting WebCL and River Trail

So, after installing all that, we need a simple way to check that it is working. The easiest way is to checkout https://github.com/JensNockert/tools-for-the-next-generation with git (or download an archive of the repository from Github).

Under “01 - Feature Detection” there are two html files, webcl.html and rivertrail.html that contain feature detection code. Try both to make sure that your setup works. If everything installed correctly, it will look something like this for River Trail,

And something like this for WebCL,

The code is not really that spectacular, but feel free to check out the source and see my horrible DOM manipulation code. (Hook me up with a pull request if you enjoy that kind of stuff)

About OpenCL

Make sure you have webcl.html open in a browser, and make a small note of the structure of the information.

The first level in the output, “Apple” in my screenshot is the OpenCL platform name and underneath all OpenCL devices corresponding to that platform (but a single piece of hardware can be devices under multiple platforms.)

In OpenCL there are two domains where code can execute, the host (in WebCL this is the browser) or on a device which is connected to a host. The code we run on the host we call the ‘Application’ and on the code on devices we call ‘Kernels’.

And as we will learn in future lessons, calling kernels is different from how we call normal functions from the host. Another important thing to note about kernels is that they are not written in Javascript but a high-performance variant of C.

Summary

To summarize on what you should install,

  • Browser capable of WebCL
  • Browser capable of accelerated River Trail

and make sure they work. The rest is mainly sugar that could help you reach that goal.

Notes

I will be using Firefox most of the time, but the example code that does not depend on a specific feature should be portable to most major platforms (Firefox, Chrome, Safari and Opera.)

Any specific feature dependencies will be noted in the corresponding article (and please point it out in the comments if it is not.)

Edits

  1. Updated for Firefox 10

Tools for the next generation of Web Applications: Introduction

I do not know how the web will evolve in the future, I don’t think that anybody knows how the web of 2020 will look like, or what applications will be popular then.

But regardless of the direction the web evolves, we will undoubtedly see more and more complex client-side applications being developed. And a lot of the applications that were traditionally native applications will probably migrate to the browser within this time frame.

The migrating applications might include everything from games to large simulations to image and video editing, and everything inbetween. Your imagination is hopefully the only limit to what you will be able to achieve.

Because betting on the web is one of the safest bets to take, it is simply the platform that is most accessible to people today, and the platform people care about the most.

The goal of this series of articles is to give you some insight into some techniques, frameworks and tools that might be useful to build this new generation of applications, or to allow you to improve your current applications.

The tools that I am most interested in are tools that enable a new class of applications that we earlier could not build in the browser without plugins, and we’ll primarily focus on the set of these are almost purely performance increasing.

For example,

  • Faster Javascript engines
  • Typed Arrays
  • SIMD Intrinsics
  • Workers
  • River Trail
  • WebCL
  • WebGL (for computing, not graphics)

But to understand these new tools of the web, we need to understand the native libraries and features that power them.

The faster Javascript engines of the future, typed arrays and any SIMD intrinsics are designed to accelerate each thread of your applications. And do so by allowing us to utilize each processing core in a better way, and program ‘closer to the metal’.

River Trail and WebCL utilizes OpenCL to allow a piece of code to run on multiple processor cores, and in the case of WebCL, allow your code to run on graphics cards and even specialist OpenCL accelerators right from your browser.

If you haven’t heard of OpenCL, it is a framework for heterogenous computing designed by Khronos (who are also maintaining the OpenGL standard), and allows you to execute kernels written in a high-performance variant of C on just about any processor around. OpenCL supports everything from large clusters down to small embedded systems.

Single-threaded

Currently there are two engines that I enjoy to code for: the new Spidermonkey with type inference introduced in Firefox 9 which is really nice, and the V8 / Crankshaft engine used in Chrome. But the Javascript engines of the future has a lot more in store for us, and all of them are already picking up steam.

For example, Mozilla is currently working on Ionmonkey. Ionmonkey is a new whole-method JIT for Spidermonkey that hopefully brings some significant speedup for many types of code (and especially the type of code that we are interested in). It isn’t ready yet, but we can already see some benchmarks here and follow how it develops.

Internet Explorer 9 introduced the new Chakra engine, which has some interesting features that will probably migrate to other engines. For example, it compiles code on a separate thread, allowing it to load code faster and start executing it quicker. And I am convinced that Internet Explorer 10 will introduce features that will allow Internet Explorer to defend its position as the most widely used browser.

One of these features that will be included in the next version of Internet Explorer (but is already supported in all other major browsers) is one of the most significant API developments in high-performance Javascript during the last few years: typed arrays. Typed arrays behave in most respects like regular Javascript arrays, but they have a fixed type and length. This on one hand gives Javascript programmers a nice way to interact with binary data and on the other hand gives the Javascript engines a lot more opportunities for optimization.

The Google Chrome team also introduced NaCl (Native Client) the last year, which is a reasonably interesting proposition from a performance standpoint, since it allows you to replace some of your Javascript with native code. It seems like you should be able to implement an OpenCL to NaCl compiler, which could be very interesting. Unfortunatly since it uses binaries instead of code, it is very hard to inspect the scripts, unlike in Javascript.

The two other browser vendors, Webkit (Apple and friends) and Opera recently shipped new browser engines, and support all the engine-level features that we currently expect, and are very likely to stay competitive in the future.

But there are a lot of other features that are in the planning stage. A pet feature of mine, for example, are SIMD intrinsics. SIMD (Single-Instruction Multiple-Data) is a method for improving computational throughput in modern processors by performing the same operation in parallel on multiple pieces of data.

These SIMD intrinsics are very simple functions that essentially map down to a few simple SIMD assembly instructions. The Javascript engine would be aware of how these functions work, and generate special optimizations for them.

This is mainly an optimization, but it would also allow us to write more easily readable code when manipulating ‘strange types’ in Javascript, for example, long (64-bit) integers.

While there are currently not even any proposals of how these SIMD intrinsics should behave, there is still a high probability that we will see something along those lines in a future revision of the Javascript language.

This leads us to more complex parallelization features that introduce more than one thread of execution.

Multi-threaded

Workers are currently the only way of executing Javascript in parallel that is widely supported in current browsers, but they are not really designed for the task. They are designed to allow for background tasks, but are not really suitable for computation on their own.

But make sure that you do not forget about them, because they are a good fallback and can be a force multiplier when combined with more advanced features.

River Trail utilizes OpenCL to execute Javascript kernels on a multi-core CPU using a friendly API. I am quite convinced that it will be a popular choice in the future.

The most compelling feature of River Trail is that it is tightly integrated with the browser, and therefore allows for a lot more optimization than WebCL (or OpenCL) allows. Don’t be surprised if a future River Trail implementation outpaces WebCL significantly on short kernels where OpenCL imposes a too high communication overhead.

Another interesting thing is that River Trail can be combined with a lot of other performance increasing features in Javascript, for example SIMD intrinsics, which (like the SIMD features in OpenCL) could significantly increase performance and readability for certain kernels.

WebCL is essentially the big brother of River Trail, exposing the full OpenCL API to the web programmer, and allows you to use unmodified OpenCL kernels in your application. It is essentially the more flexible version of River Trail, and is designed to allow you to use any OpenCL accelerator in the system, including graphics processors and so on.

WebCL is also the API that we will be using the most throughout this series, mainly since the compilers are mature, and it is also the language and framework that I am most familiar with.

But River Trail has some interesting opportunities that we won’t see in WebCL since could be tighter integrated into the browser at a future point. For example could an implementation of River Trail significantly reduce the communication overhead required to run kernels on the CPU, which is currently quite significant in OpenCL.

WebCL currently has the advantage that the infrastructure is a bit more mature on the kernel side, on the Javascript side both technologies are noticably not ready.

WebGL on the other hand has the advantage that it is reasonably mature, and allows execution on just about every graphics card available.

But I generally wouldn’t recommend using it for computation though unless you have very specific requirements, since even the simplest tasks can easily turn extremely complex unless you’re very good at GLSL and WebGL. It is simply not designed for computation, only graphics.

Conclusion

There are many tools and frameworks already available in a pre-release form for us to play with, and the best way to get used to them is to actually use them. Just be be aware of the changing and non-final nature and use this time to your advantage, most developers won’t start using these tools until they are almost ready, and by then it is to late to influence their growth.

In addition to tools that ‘merely’ grant us faster performance, we have a lot of tools that simply allow us to do a lot of things that we could not do before, but those are interesting enough to get their own introductions when we meet them later in the series.

The next episode will be a shorter one, and contain instructions on how to set up our development environment on Windows or Linux. For example, installing different OpenCL drivers, WebCL plugins and River Trail.

Notes

There are currently at least three different implementations of WebCL,

A wishlist for ES6

First, I am not a web designer, and I realize that my needs are not the same as everyone elses. But I am going to argue that there are only a few features that need to be added to Javascript in the next version (ES6 or Harmony).

  1. Typed Arrays
  2. An improved numerical library
  3. Continuations

And you will probably have your own list, but if you don’t agree with at least the first two, then you are probably wrong. They are really important, both in the browser and outside. The third is more of a personal preference, but I think it is probably the one change that would improve the language the most.

Typed Arrays

Khronos’ typed arrays are essentially just an optimization, they work just like arrays, but you can only store a specific type, and their size is specified when the object is created.

This gives a bit of extra performance for many kinds of applications. And in the future we will probably write a lot more applications that manipulate binary data in Javascript. In addition, the interface is pretty good and provides sugar for different sorts of type-conversion. It does not introduce any new syntax and most browsers support them to a certain degree already, so backwards compatibility isn’t a problem.

The only thing missing in the current specification is a way to determine the native endianness of the machine the it is executing on, and you therefore need a small method to do it for you.

Float64Arrays are not implemented in Safari, DataViews are not implemented in Safari, Firefox or Opera. Typed Arrays are not supported at all in IE before version 10.

Numerical Library

The numerical library in Javascript (the Math module) is not only exceptionally sparse, it isn’t defined what it should do either. Math.sin(x) could return 4 and still follow the specification.

This makes coding to the specification impossible, and useless, since only the name of the function is defined, and a few edge cases. Four is a valid approximation of the value of sine, sin(x) = cos(x) - 1 is another valid approximation of sine that is legal but not very good. Or sin(x) = 1 - e^x which quickly becomes an extremely bad approximation.

The proposal for ES6 isn’t any better, since it does not include specifications either. The correct solution to the problem is probably that all operations should be correctly rounded (as in the IEEE 754-2008 specification) but OpenCL (see Chapter 7) provides relative error bounds, and that would be acceptable solution to the specification problem as well.

There are also some significant functions in IEEE 754-2008 that are not included in the current ES6 proposal, the most significant is the Fused Multiply-Add which is quite slow to emulate in software.

OpenCL also provides a lot of useful functions that may be interesting to implement, especially for games and other applications that require geometry and colour operations. And if the new correctly rounded functions are too slow for embedded devices, or for specific applications, then adding native, fast, half (or single) versions of some of the functions could make sense. (see the OpenCL specification again)

This change would not really affect any applications, if anything, it would make the output of a Javascript application using the Math module a lot more predictable. It also does not introduce any new syntax, and should be easy to add.

Continuations

Adding continuations is more of a personal thing, I find them extremely useful in other languages, and I would find them extremely useful to handle the whole callback-inception that you usually get stuck in while coding Javascript.

It is a more controversial extension to the language since it actually involves syntax, but on the other hand, a reasonable Python-style implementation is already available in Spider Monkey (under the name generators), so any syntax changes is arguably already there.

Summary

I don’t really think that we need a lot of the syntax that people are proposing and while some are reasonable (blocks for example), I do not really feel that we really need them.

The same thing for classes and so on, we should be relatively restrictive about adding syntax, since it restricts our ability to extend syntax in the future. On the other hand, we can be relatively liberal when adding library functions that can be deprecated in later versions of the standard.