Improve RGB to luma conversion logic#2988
Conversation
This code leads to pshufb which is available only from SSSE3, and actually from my view LLVM is not tuned to recognize that pattern well until AVX2, on AVX2 codegen actually looks normal, might be not perfect, but OK. And since your multiplier is huge it leads to requirement of 64-bit product ( to the compiler perspective ) what makes things even worse. Does that one make sense? |
OMG, you're right. When I modify your version to use my constants, the compiler emits 64-bit Unfortunately, your constants don't produce exact results. IIRC, the smallest shift for which correct constants exist is 22. Is manual SIMD the only answer here? |
Yeah, for constants that complex I don't think manual work is avoidable. Lowering precision is likely the only option. I guess it should find the "right" constant up to i16::MAX, with an i32 intermediate product somewhere in the range [i16::MAX, u16::MAX] which may not be exact, but the error is actually very small. You can see that this range already worse than (0,i16::MAX) but still reasonable https://godbolt.org/z/MjqhGdvhz |
Changes:
Rgb<T>: EnlargeableandRgba<T>: Enlargeable.My main goal was to make the RGB to luma conversion faster. To that end, I added optimized implementations for u8 and f32. However, this didn't work out. The benchmarks show no change at all. Frankly, I don't understand why. The new implementations are doing purely less work and should be close to optimal.
u8CE: Both the old generic and optimized implementations do not vectorize at all. The compiler also loads each byte individually. The only difference is that the optimized version does everything in 32-bit registers, while the generic version uses 64-bit registers to perform the division with a multiply shift (which in turn leads to the compiler unrolling the optimized version once).f32CE: The generic version usesf64to do all calculations while the optimized usesf32. The SIMD for both is bad. The f32 version is actually super easy to vectorize well (just do rgba * [W_R, W_G, W_B, 0] and then sum), but the compiler doesn't do this.In any case, even without a perf win, removing the
Enlargeablebound and correct rounding foru8should still be a win.