WebAssembly SIMD

Advanced

Concept

ICP supports deterministic WebAssembly SIMD support. This is a significant milestone for smart contracts demanding top onchain performance, such as artificial intelligence (AI), image processing (NFTs), games, scientific decentralized applications (dapps), and more.

However, a significant performance boost is also possible for "classical" blockchain operations implemented in canisters. For example, reward distribution or cryptographic operations might benefit from the new WebAssembly SIMD instructions.

What is WebAssembly SIMD?

WebAssembly SIMD (single instruction, multiple data) is a set of more than 200 deterministic vector instructions defined in the WebAssembly core specification. This parallel processing significantly accelerates specific tasks within canisters running on ICP.

The SIMD functionality is available on every ICP node.

Developer benefits

WebAssembly SIMD support enables a new level of performance on ICP. Developers can:

Optimize code for computationally heavy tasks: Identify areas within their canisters that can benefit from SIMD instructions and tailor their code for accelerated performance.
Unlock new possibilities: Explore novel functionalities and complex applications that were previously limited by processing power.
Build a future-proof foundation: Positions developers at the forefront of blockchain innovation.

Using WebAssembly SIMD

There are two main ways to benefit from WebAssembly SIMD in a smart contract:

Loop auto-vectorization: Just enabling the WebAssembly SIMD and recompiling the project might be enough to get a significant performance boost. This is usually simple, error-proof, and can be a one-line change. This is often the recommended first step, but the result depends heavily on the used algorithms, libraries, and compilers.
SIMD intrinsics: Some computation-heavy functions may be rewritten using direct SIMD instructions. This exposes the full SIMD potential, but in many cases some core canister algorithms must be completely rewritten using new instructions.

Using loop auto-vectorization

To leverage the loop auto-vectorization, the WebAssembly SIMD instructions should be enabled globally for the entire workspace, or locally for specific functions within the canister. Once the instructions are available to the compiler, it automatically converts some normal loops into loops with parallel computations.

While the change is easy and error-proof, the result in practice depends on many factors, like the algorithm itself, the compiler optimization level and options, project dependencies, etc.

Example

To enable WebAssembly SIMD instructions globally for the whole workspace and all its dependencies:

Rust

Create the `.cargo/config.toml` file with the following content:

[build]
target = ["wasm32-unknown-unknown"]

[target.wasm32-unknown-unknown]
rustflags = ["-C", "target-feature=+simd128"]

To enable WebAssembly SIMD instructions just for a specific function within a canister:

Rust

#[target_feature(enable = "simd128")]
fn auto_vectorization() {
    ...
}

WebAssembly SIMD instructions may be enabled by default in future dfx versions, so enabling it for a specific function within a canister might have no effect.

Using WebAssembly SIMD intrinsics

WebAssembly SIMD instructions are available as platform-specific intrinsics for the wasm32 platform. To use the intrinsics, the WebAssembly SIMD instructions should be enabled as described in the previous section.

Example

Here's a short code snippet demonstrating how to multiply two arrays of four float elements each using a single SIMD instruction:

Rust

#[inline(always)]
#[target_feature(enable = "simd128")]
pub fn mul4(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    use core::arch::wasm32::*;

    // Load the arrays `A` and `B` into the SIMD registers.
    let a = unsafe { v128_load(a.as_ptr() as *const v128) };
    let b = unsafe { v128_load(b.as_ptr() as *const v128) };

    // Multiply elements of `A` and `B` using a single SIMD instruction.
    let c = f32x4_mul(a, b);

    // Store and return the result.
    let mut res = [0.0; 4];
    unsafe { v128_store(res.as_mut_ptr() as *mut v128, c) };
    res
}

Frequently asked questions

How to measure performance speedup of a canister?

ICP provides the ic0.performance_counter system API call to measure a canister's performance.

There is also the canbench benchmarking framework.

Are there any libraries for artificial intelligence (AI) inferences?

The Sonos tract is a tiny, self-contained, Tensorflow and ONNX inference Rust library. DFINITY contributed WebAssembly SIMD support to the library. The library is used in some DFINITY AI demos and examples.

References and examples

WebAssembly SIMD Rust example compares the performance of a naive, optimized, auto-vectorized and SIMD intrinsic matrix multiplication running on ICP.
WebAssembly core specification for SIMD instructions.

WebAssembly SIMD

What is WebAssembly SIMD?​

Developer benefits​

Using WebAssembly SIMD​

Using loop auto-vectorization​

Example​

Using WebAssembly SIMD intrinsics​

Example​

Frequently asked questions​

How to measure performance speedup of a canister?​

Are there any libraries for artificial intelligence (AI) inferences?​

References and examples​

What is WebAssembly SIMD?

Developer benefits

Using WebAssembly SIMD

Using loop auto-vectorization

Example

Using WebAssembly SIMD intrinsics

Example

Frequently asked questions

How to measure performance speedup of a canister?

Are there any libraries for artificial intelligence (AI) inferences?

References and examples