5× faster `fast_blur` in `image-rs`

A few weeks ago, I went looking for something to optimize, for fun. The Rust image crate jumped out: I'd used it before, it's one of the most downloaded crates, and image work is heavy by nature. I quickly found a method called fast_blur. With a name like that, it seemed worth trying to optimize.

The result: up to 5.9x faster on images with u8 pixels.

fast_blur σ=3×5.9 faster

before

51.1 ms

after

8.6 ms

fast_blur σ=7×5.7 faster

before

51.5 ms

after

9.0 ms

fast_blur σ=50×5.1 faster

before

52.2 ms

after

10.2 ms

(lower is better)Wall-time measurement on x86_64

Before we get into the optimizations, let's take a step back and understand how blurs work, the tradeoffs between different algorithms, and where fast_blur fits in the spectrum of blur quality and performance.

Blur performance and quality

Gaussian Blur

Probably the most common blur and what people usually picture when they say "blur".

It replaces each pixel with a weighted average of its neighbors, where $\sigma$ controls how far the influence reaches. In practice, since pixels are discrete, this translates to a $k\times k$ convolution kernel where $k$ , the kernel size, scales directly with $\sigma$ .

Original

Gaussian blur, σ=15

Because of the convolution, a naive 2D gaussian blur costs $O(k^2)$ per pixel.

But the gaussian blur can be separated into two 1D-passes, effectively bringing the complexity down to $O(k)$ per pixel.

2D ConvolutionO(k²) per pixel

SeparableO(k) per pixel

This is what the blur from image-rs implements but there are ways to blur way faster if we can sacrifice some quality.

Box Blur

This one is much simpler: each pixel is replaced by the average of its neighbors within a fixed-size window. Essentially, a convolution by kernel with a uniform weight.

This effectively blurs the image but the lack of smoothing in the kernel produces a harsher, less natural-looking blur than a gaussian:

Box blur, σ=15

Gaussian blur, σ=15

Like the gaussian blur, the box blur can be separated into two 1D passes And since all weights are equal to $1/k^2$ (for a $k\times k$ box blur), we can factor the division out: accumulate all the values, then divide once at the end.

2D ConvolutionO(k²) per pixel

SeparableO(k) per pixel

We can go further and each pass can use a sliding window instead of recomputing the average from scratch. In the end for each pixel, we end up with around 2 additions per pass (vertical and horizontal) and 1 division to average the sum, making this blur $O(1)$ per pixel regardless of kernel size.

Input (n=10)

120

200

150

180

160

110

sum = 120 + 200 + 80 = 4002 additions

Output

So we went from $O(k)$ to $O(1)$ per pixel – much faster. The cath: the is blockier and less natural than the gaussian.

Fast Almost-Gaussian Filtering aka `fast_blur`

This is a way to get the best of both worlds, as long as exact accuracy is not a hard requirement. The idea¹ is that applying 3 successive box blurs with carefully chosen kernel sizes produces a result that closely approximates a true gaussian blur.

Each additional box blur pass smooths the kernel further toward a gaussian:

1 box blur (k=13)

The good news is that effectively, we can get a much better-looking blur but only using box blurs.

Box blur, σ=15

Fast blur, σ=15

Gaussian blur, σ=15

Since this only uses box blurs, we now have a decent-looking blur but at a $O(1)$ cost per pixel regardless of sigma, and this is exactly what fast_blur from image-rs is.

The hot path

Now that we have a bit more context on the algorithm, its complexity and its quality, we can start looking at the code and try to make it faster. First step is to profile it and see where the time is going. I used the profiling tools we're building at CodSpeed, along with the MCP integration and an agent to quickly identify the hot spots.

The fast_blur flame graph for a u8 image is dominated by three costs:

roundf: 27% of total time
to_f32: 22% of total time
min/max: 20% of total time

The hot path converts every u8 subpixel to f32, accumulates with float arithmetic, then converts back with rounding_saturating_mul, which internally calls the C library's roundf. For a 1920×1080 RGBA image, that's over 8 million pixels, each hitting this float round-trip multiple times across the 6 box blur passes (3 horizontal + 3 vertical).

Dropping floats for integers

A u8 pixel has values in 0-255 and a box blur kernel sums at most ~width (or height) pixels. The max accumulator value we can store in a u32 is u32::MAX / 255, which is about 16.8 million, the equivalent of 2 entire 4k images. Just for the blur kernel. Realistically, this is more than enough headroom for any blur radius we would ever use.

All the accumulation could be done with integer arithmetic, eliminating float conversions, roundf calls, and min/max induced by rounding_saturating_mul, which was clamping to the u8 range.

Stripped of all the boilerplate, the inner loop goes from this:

let mut sum: f32 = 0.0;
for i in kernel_range {
    sum += src[i].to_f32().unwrap();
}
*dst = rounding_saturating_mul(sum, 1.0 / kernel_size as f32);

to this:

let mut sum: u32 = 0;
for i in kernel_range {
    sum += src[i] as u32;
}
*dst = ((sum + kernel_size / 2) / kernel_size) as u8;

This gives the exact same blur output, but without going through the expensive float operations.

One thing to note is the rounding bias add (kernel_size / 2) before the division. Adding this value before the integer division gives us the same rounding behavior as roundf does for positive floats but at the cost of a single addition per pixel.

Designing the trait

The technique works for u8 pixels, but image supports many other primitive types: u16, f32, even f64. Some users blur HDR float images, others blur 8-bit thumbnails. Forcing integer accumulators on f32 pixels makes no sense and conversely, keeping every path on f32 would block the u8 win we just designed.

So the optimization needs to live alongside the existing float path, and without sprinkling if pixel_type == u8 branches across every blur pass, since that would either defeat the point or make the code really hard to maintain.

Since the shape of the operation is the same in both worlds (start at zero, add samples, divide and store), only the types and the arithmetic differ. After a few iterations with the maintainers, the design landed on a BlurAccumulator<T> trait: parameterized by the source pixel type, with the impl target being the accumulator type. Stripped of error-handling and edge-case methods:

pub(crate) trait BlurAccumulator<T>: Copy {
    type Weight: Copy;
    const ZERO: Self;

    fn from_primitive(value: T) -> Self;
    fn create_weight(kernel_size: usize) -> Self::Weight;
    fn to_store(self, weight: Self::Weight) -> T;
}

With this trait, the inner loop becomes generic over both the pixel type and its accumulator:

let weight = Acc::create_weight(kernel_size);
let mut sum = Acc::ZERO;
for i in kernel_range {
    sum += Acc::from_primitive(src[i]);
}
*dst = sum.to_store(weight);

By putting the trait on the accumulator (and not the source), a u32 knows how to take u8 samples and average them while the pixel type is just one of the inputs.

Two implementations cover everything. u32 for the u8 fast path:

impl BlurAccumulator<u8> for u32 {
    type Weight = u32; // kernel size
    const ZERO: u32 = 0;

    fn from_primitive(v: u8) -> u32 { v as u32 }
    fn create_weight(ks: usize) -> u32 { ks as u32 }
    fn to_store(self, ks: u32) -> u8 {
        ((self + ks / 2) / ks) as u8
    }
}

f32 for everything else, preserving the original float path verbatim:

impl<T: Primitive> BlurAccumulator<T> for f32 {
    type Weight = f32;
    const ZERO: f32 = 0.0;

    fn from_primitive(v: T) -> f32 { v.to_f32().unwrap() }
    fn create_weight(ks: usize) -> f32 { 1.0 / ks as f32 }
    fn to_store(self, w: f32) -> T { rounding_saturating_mul(self, w) }
}

A small companion trait on the source primitive picks the right accumulator at the type level:

pub trait WithBlurAcc: Sized {
    type BlurAcc: BlurAccumulator<Self>;
}

impl WithBlurAcc for u8 { type BlurAcc = u32; }
impl WithBlurAcc for u16 { type BlurAcc = f32; }
// ... f32, f64, i8, i16, ... all map to f32

The blur pass becomes generic over both types, and the right impl resolves entirely at compile time. The compiler picks u32 for u8 paths and f32 for everything else, and inlines accordingly.

In practice, since the u32 accumulator might overflow when the kernel size is bigger than 16M pixels, there is still one runtime check to fall back to f32 accumulation when the radius is huge. But normal use cases are well within the u32 limit, so this is a non-issue.

Impact

Dropping the floats alone gives a ~×1.83 wall-clock speedup across all fast_blur benchmarks, with no quality regression on the test suite:

fast_blur σ=3×1.8 faster

before

51.1 ms

after

27.9 ms

fast_blur σ=7×1.8 faster

before

51.5 ms

after

28.1 ms

fast_blur σ=50×1.8 faster

before

52.2 ms

after

28.4 ms

(lower is better)before: f32 accumulator (main). after: u32 accumulator with integer division.

Avoiding divisions

The integer accumulator optimization replaced float arithmetic but left (sum + half_k) / kernel_size_u32. This eliminated roundf and to_f32, but the div, the division instruction, was still a significant hot spot.

Even if we're talking about a single assembly instruction, integer division is one of the most expensive arithmetic operations on modern CPUs. A single div clogs the pipeline for 20–30 cycles², and unlike most arithmetic it can't be pipelined, meaning the CPU stalls until it completes. For a 1920×1080 RGBA image with 3 box blur passes, that's over 50 million divisions with a stalled pipeline.

But with a constant kernel_size for all pixels within a single blur pass it must be possible to precompute a part of this.

The technique

In real arithmetic, multiplying numerator and denominator by $2^{32}$ changes nothing:

$\dfrac{n}{d} = \dfrac{n \cdot 2^{32} / d }{2^{32}}$

This would be useful only if $2^{32}/d$ ends up as a whole number, which it rarely does. However, Granlund & Montgomery (1994)³ shows that using the ceiling of the reciprocal instead of the exact reciprocal gives an approximation that is guaranteed to round down to the correct integer quotient:

$\left\lfloor \dfrac{n}{d} \right\rfloor = \left\lfloor \dfrac{n \cdot \lceil 2^{32} / d \rceil}{2^{32}} \right\rfloor$

And this is exact as long as $(d-1) \cdot n < d \cdot 2^{32}$ , which holds under the u32 overflow bound on kernel_size we already had for the integer accumulator.

The reciprocal $\lceil 2^{32} / d \rceil$ or ceil(2^32 / kernel_size) can be precomputed once per blur pass, and the final division by $2^{32}$ becomes a single right shift on the 64-bit product: >> 32 amongst the simplest arithmetic assembly instruction that can also be pipelined extremely well.

Stripped down, this:

*dst = ((sum + kernel_size / 2) / kernel_size) as u8;

Becomes:

*dst = (((sum + kernel_size / 2) as u64 * reciprocal) >> 32) as u8;

After having defined reciprocal once per blur pass:

let reciprocal = (u32::MAX / kernel_size) + 1;

This replaces the expensive div with a 64-bit multiplication and a bit-shift, which on modern CPUs is around 3–4 cycles.

Implementation

To make this clean inside the trait, the Weight for the u32 accumulator carries both the reciprocal and the rounding bias:

#[derive(Clone, Copy)]
struct U8Weight {
    reciprocal: u32,    // ceil(2^32 / kernel_size)
    rounding_bias: u32, // kernel_size / 2
}

And the BlurAccumulator<u8> impl on u32 becomes:

impl BlurAccumulator<u8> for u32 {
    type Weight = U8Weight;
    // ...
    fn create_weight(kernel_size: usize) -> U8Weight {
        let ks = kernel_size as u32;
        U8Weight {
            // ceil(2^32 / ks) = floor((2^32 - 1) / ks) + 1 = (u32::MAX / ks) + 1
            reciprocal: (u32::MAX / ks) + 1,
            rounding_bias: ks / 2,
        }
    }

    fn to_store(self, w: U8Weight) -> u8 {
        let n = self + w.rounding_bias;
        ((n as u64 * w.reciprocal as u64) >> 32) as u8
    }
}

Impact

The reciprocal substitution alone gives a ×3 speedup on top of the integer accumulators across all kernel sizes:

fast_blur σ=3×3.2 faster

before

27.9 ms

after

8.6 ms

fast_blur σ=7×3.1 faster

before

28.1 ms

after

9.0 ms

fast_blur σ=50×2.8 faster

before

28.4 ms

after

10.2 ms

(lower is better)before: integer w/division after: integer w/reciprocal

This confirms that the division was the dominant cost in the inner loop after dropping floats, and that replacing it with a reciprocal multiplication is a huge win.

Where this lands

Overall, the combination of integer accumulators and reciprocal multiplication gives a 5.9× speedup on u8 images, from ~52ms to ~8ms. That's 120 frames per second instead of 19, fast enough for real-time applications like video streaming or games.

Both wins boil down to one observation: not all instructions are equal. Floating-point instructions cost orders of magnitude more than integer ones. div costs roughly an order of magnitude more than a mul and a shift. One thing is sure, CPU pipelines are fascinating, and I'm not done digging into them.

The full implementation was merged and will be available in the next release of image-rs. Thanks to @awxkee for the arithmetic expertise, to @197g and @RunDevelopment for design feedback.

Kovesi, P.: Fast Almost-Gaussian Filtering. The Australian Pattern Recognition Society Conference: DICTA 2010. December 2010. Sydney. PDF ↩
Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. PDF ↩
Based on Theorem 4.2 from Granlund, T., Montgomery, P.L.: Division by Invariant Integers using Multiplication. ACM SIGPLAN Notices, 29(6), 1994. PDF ↩