Skip to content

Improve RGB to luma conversion logic#2988

Open
RunDevelopment wants to merge 3 commits into
image-rs:mainfrom
RunDevelopment:opt-rgb-to-luma
Open

Improve RGB to luma conversion logic#2988
RunDevelopment wants to merge 3 commits into
image-rs:mainfrom
RunDevelopment:opt-rgb-to-luma

Conversation

@RunDevelopment
Copy link
Copy Markdown
Member

Changes:

  • Move RGB -> Luma conversion into sealed trait.
  • Change u8 behavior. Luma is now correctly rounded.
  • Remove trait bound Rgb<T>: Enlargeable and Rgba<T>: Enlargeable.
  • Expand convert benchmark

My main goal was to make the RGB to luma conversion faster. To that end, I added optimized implementations for u8 and f32. However, this didn't work out. The benchmarks show no change at all. Frankly, I don't understand why. The new implementations are doing purely less work and should be close to optimal.

  • u8 CE: Both the old generic and optimized implementations do not vectorize at all. The compiler also loads each byte individually. The only difference is that the optimized version does everything in 32-bit registers, while the generic version uses 64-bit registers to perform the division with a multiply shift (which in turn leads to the compiler unrolling the optimized version once).
  • f32 CE: The generic version uses f64 to do all calculations while the optimized uses f32. The SIMD for both is bad. The f32 version is actually super easy to vectorize well (just do rgba * [W_R, W_G, W_B, 0] and then sum), but the compiler doesn't do this.

In any case, even without a perf win, removing the Enlargeable bound and correct rounding for u8 should still be a win.

@awxkee
Copy link
Copy Markdown
Contributor

awxkee commented May 25, 2026

u8 CE: Both the old generic and optimized implementations do not vectorize at all. The compiler also loads each byte individually. The only difference is that the optimized version does everything in 32-bit registers, while the generic version uses 64-bit registers to perform the division with a multiply shift (which in turn leads to the compiler unrolling the optimized version once).

This code leads to pshufb which is available only from SSSE3, and actually from my view LLVM is not tuned to recognize that pattern well until AVX2, on AVX2 codegen actually looks normal, might be not perfect, but OK. And since your multiplier is huge it leads to requirement of 64-bit product ( to the compiler perspective ) what makes things even worse. Does that one make sense?

@RunDevelopment
Copy link
Copy Markdown
Member Author

And since your multiplier is huge it leads to requirement of 64-bit product ( to the compiler perspective ) what makes things even worse. Does that one make sense?

OMG, you're right. When I modify your version to use my constants, the compiler emits 64-bit pmuludq. I didn't think that the compiler would be that bad at doing range analysis.

Unfortunately, your constants don't produce exact results. IIRC, the smallest shift for which correct constants exist is 22. Is manual SIMD the only answer here?

@awxkee
Copy link
Copy Markdown
Contributor

awxkee commented May 26, 2026

Unfortunately, your constants don't produce exact results. IIRC, the smallest shift for which correct constants exist is 22. Is manual SIMD the only answer here?

Yeah, for constants that complex I don't think manual work is avoidable. Lowering precision is likely the only option. I guess it should find the "right" constant up to i16::MAX, with an i32 intermediate product somewhere in the range [i16::MAX, u16::MAX] which may not be exact, but the error is actually very small. You can see that this range already worse than (0,i16::MAX) but still reasonable https://godbolt.org/z/MjqhGdvhz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants