Converting RGBA to Yuv420p in Glsl/Vulkan
This post is about doing an image conversion on the graphics card itself. This
operation is necessary in your graphics pipeline to do video capture as the
native image encoding of your target video codec (e.g. mpeg-4
, vp9
, …) is
unlikely to be rgb
. Doing the conversion within your pipeline promises a
throughput and latency gain on video encoding compared to running a third-party
screen capture software. Additionally, this nets you a bit of artistic freedom
by putting the exact frame timings and cuts under your direct control.
First, we will go over the basics of color spaces followed by a (short) intro to compute shaders. I will not cover the details of the Vulkan API, there are better tutorials and guides in other places.
Color spaces
Very succinctly, the human eye typically perceives three different colors through three different kinds of receptors in your retina. The neurological response of each of these can be expressed through a convolution involving the intensities of incoming wave lengths on one hand and a specific sensitivity curve on the other hand.
Vanessaezekowitz at en.wikipedia, https://en.wikipedia.org/wiki/File:Cones_SMJ2_E.svg
The differences in the response to specific light intensities is referred to as
tristimulus and maps to ones understanding of color. Since perception of color
now seems to depend mostly on the relative response of the three different
stimuli–recall that convolution is linear–CIE xyY color representation shows the
relative red (x
) and green (y
) parts together with the original response
(Y
), as a measure of luminance. The first two parts together are referred to
as chromaticity.
Due to the linearity of the stimuli and the nature of the transform, it follows
instantaneously that the mixture of two light sources appears on the connecting
line between the light sources themselves. In particular, color
representations form a convex set. Most color representations now choose three
points on the plane to form a triangle covering the necessary colors. These
three are called the primary colors and the area covered is called the gamut.
The white point refers to the representation of a well-defined light source
which is used to normalize the color representation. The conversion of one
representation to another is then a linear transformation, expressed as a 3x3
matrix. All color values expressible in a representation lie inside the
simplex (triangle) specified by the location of the primaries and components
are in the interval [0; 1]
.
This model of course permits storing another set of values to represent the same
color values. The YUV
model does not store the values of the primaries but
instead stores a luminance value and scaled differences between the luminance
and the blue and red primaries to normalize the color space values against the
differences in human perception. A particular distance should correspond
roughly to the same perception difference across the complete space, which is
not true in the case of CIE xyY. The main usability improvement is obtained
when one also considers an analog or digital encoding of the color data, the
latter originally referred to as YCbCr
but now the terms are sometimes mixed
or all encodings collectively named YUV
.
By normalizing values, this ensures that image or video format specific lossy
encodings specifically preserve perception. To further improve quantization
errors a transfer function is introduced. This non-linear but monotonic
transformation compresses values in the high-end of the linear values and is
applied before a linear RGB
to YCbCr
mapping, to each component
individually. This electro-optical function, as the name implies, maps color
values from voltage signals. This is of particular importance for the analog
part of equipment and the exponent used is called gamma
. Since up-to-date
functions do not use a simple power function, the gamma is either an
approximation over the complete curve or refers to the power section only. We
will see two different transfer functions later, when we discuss the details of
sRGB
and Rec.709
.
A short note on the effects of doing something wrong: If an image displayed on
your monitor does not have any differentiable colors in the blacks, it might be
that some part of the display process is interpreting linear RGB
as if it were
sRGB
, consequently wrongly uncompressing the high-end an additional time. But
that will lose all the differences in the low-end of the color spectrum,
especially if the image is quantized to 8-bit sample values. For a gamma of
2.2
, it will already lose half the available bit depth in the process. As a
simple test, a 10% (linear) gray should still be easily differentiable from
black but would be compressed into 0.0389, which is quantized as 2
in 8-bit
system. That means you lose almost all of the color information between that
and black.
Compute shader setup
A compute shader behaves very much like a single-stage graphics shader, except
instead of being executed once per vertex it is executed in workgroups, a
collection of shader invocations, of which you control the size when
dispatching. A parameter in this call provides the number of individual
invocations specified as a 3d cuboid. Two special variable,
gl_LocalInvocationId
and gl_GlobalInvocationId
, give the index of a single
execution within this cuboid. Both have type uvec3
. The global index is
relative to the coboid specified as a parameter during dispatch while the local
index is relative to a smaller cuboid, which is part of the complete invocation
block. Since execution of a single invocation would be inefficient, the gpu
will schedule a number of invocations simultaneously with the same execution
flow.
This atomic group is the local invocation cuboid, or workgroup. It will
(typically) share control flow, so try to avoid conditional blocks (if
and
for
) that evaluate differently on two different invocations within the same
local workgroup. In these cases the different code paths might be taken
sequentially, essentially costing you performance for code paths that are not
necessary in some invocations. As a compensation, instances in the same
workgroup can share result locally and profit from some additional guarantees
for atomic operations. The size of the local workgroup is determined by an
annotation in glsl or automatically if none is found. However, we do not need
any of this right now, the information is here purely for the sake of
completion.
The YUV420p format
This is a planar image format, where planar refers to the fact that the memory
representation groups the samples by component and not by pixel. It is based on
a particular YCbCr
color space, in our case BT.709
also known as Rec.709
.
In its encoding the Y
values are followed by U
and V
. To save additional
space, it only stores aggregated U
and V
samples, one for each two-by-two
block in the original image.
Note: The transfer function differs between sRGB
and Rec.709
, the two
color representation processed here. Consequently, we need to linearize any
sRGB
representation and then retransfer it to stay truly accurate. Any source
that tells you to simply apply a linear transformation to sRGB
in order to
receive Rec.709
encoding is inaccurate. In a typical cpu bound encoding
process, only working on integer data, this inaccuracy might be justified by the
extreme saving in processor cycles and by avoid floating-point representations.
These two issues are of no concern in a gpu setting, so we might as well do them
correctly.
For the following transformations the color values are assumed to be normalized
into the range [0; 1]
. Only the quantization transformation is dependent on
integer representations and bit-depth. The transfer function
(luminance->voltage) for Rec.709
is given by:
⎧ 4.5*L if L < 0.018
V = ⎨
⎩ 1.099*(L**0.45) − 0.099 if L ≥ 0.018
Compare this with sRGB
, even though we will later see that is likely
irrelevant for now as we don't have to do this manually:
⎧ 12.92*L if L < 0.0031308
V = ⎨
⎩ 1.055*(L**1/2.4) − 0.055 if L ≥ 0.0031308
The multiplication matrix to derive YCbCr
is specified as:
⎛Y' ⎞ ⎛ 0.2215 0.7154 0.0721⎞ ⎛R'⎞
⎜Cb'⎟ = ⎜-0.1145 -0.3855 0.5000⎟ · ⎜G'⎟
⎝Cr'⎠ ⎝ 0.5016 -0.4556 -0.0459⎠ ⎝B'⎠
Finally, quantization of values is performed to arrive at a digital signal.
Here n
is the target bit-depth:
DY' = int[(219·Y' + 16) ·2**(n-8)]
DCb' = int[(224·Cb' + 128)·2**(n-8)]
DCr' = int[(224·Cr' + 128)·2**(n-8)]
The shader
Each invocation of the shader will work on an 8-by-2 block of the input image.
The result is then a complete vec4 of data for the u and v components. We
further assume that the height is a multiple of four to make address computation
a bit easier. That should make it much more efficient to store the result into
the output image, since store and load operations only work on complete pixel
values. The input to our shader will either of format ..Unorm
or ..Srgb
.
In either case, the loaded value will already be in linear format, so we can
skip applying the sRGB
transfer function. This makes the process
straighforward, although there might be room for optimization:
#version 450
layout( binding = 0, rgba8 ) uniform readonly image2D rgb;
layout( binding = 1, rgba8 ) uniform writeonly image2D result;
// Note that rec.709 and sRGB have the same primaries.
const mat3 mat_rgb709_to_ycbcr = mat3(
0.2215, 0.7154, 0.0721,
-0.1145, -0.3855, 0.5000,
0.5016, -0.4556, -0.0459
);
float rgb709_unlinear(float s) {
return mix(4.5*s, 1.099*pow(s, 1.0/2.2) - 0.099, step(0.018, s));
}
vec3 unlinearize_rgb709_from_rgb(vec3 color) {
return vec3(
rgb709_unlinear(color.r),
rgb709_unlinear(color.g),
rgb709_unlinear(color.b));
}
vec3 ycbcr_from_rgbp(vec3 color) {
vec3 yuv = transpose(mat_rgb709_to_ycbcr)*color;
vec3 quantized = vec3(
(219.0*yuv.x + 16.0)/256.0,
(224.0*yuv.y + 128.0)/256.0,
(224.0*yuv.z + 128.0)/256.0);
return quantized;
}
vec3 sRGB_to_yuv(vec3 color) {
return ycbcr_from_rgbp(unlinearize_rgb709_from_rgb(color));
}
void main() {
uint result_w = imageSize(rgb).x/4;
uvec2 self_id = gl_GlobalInvocationID.xy;
ivec2 coords = ivec2(self_id.x*8, self_id.y*2);
vec3 yuv [16];
int index_x, index_y;
for(index_y = 0; index_y < 2; index_y += 1) {
for(index_x = 0; index_x < 8; index_x += 1) {
vec4 orig_color = imageLoad(rgb, coords + ivec2(index_x, index_y));
vec3 yuv_color = sRGB_to_yuv(orig_color.rgb);
yuv[index_y*8 + index_x] = yuv_color;
} }
// Store back the y values.
for(index_y = 0; index_y < 2; index_y += 1) {
for(index_x = 0; index_x < 2; index_x += 1) {
int i = index_y*8 + index_x*4;
vec4 yyyy = vec4(yuv[i].x, yuv[i+1].x, yuv[i+2].x, yuv[i+3].x) + 1.0/16.0;
imageStore(result, ivec2(2*self_id.x + index_x, 2*self_id.y + index_y), yyyy);
} }
ivec2 top_right_u = ivec2(0, imageSize(rgb).y);
ivec2 top_right_v = ivec2(0, imageSize(rgb).y + imageSize(rgb).y/4);
float us[4];
float vs[4];
for(index_x = 0; index_x < 4; index_x += 1) {
int i = index_x*2;
vec4 uuuu = vec4(yuv[i].y, yuv[i+1].y, yuv[i+8].y, yuv[i+9].y);
vec4 vvvv = vec4(yuv[i].z, yuv[i+1].z, yuv[i+8].z, yuv[i+9].z);
us[index_x] = dot(uuuu, vec4(1.0))/4.0;
vs[index_x] = dot(vvvv, vec4(1.0))/4.0;
}
// Group u and v output
vec4 ucode = vec4(us[0], us[1], us[2], us[3]);
vec4 vcode = vec4(vs[0], vs[1], vs[2], vs[3]);
uint uv_sample_count = self_id.x + self_id.y*(imageSize(rgb).x/8);
ivec2 relative = ivec2(uv_sample_count%result_w, uv_sample_count/result_w);
imageStore(result, top_right_u + relative, ucode);
imageStore(result, top_right_v + relative, vcode);
}
(Free from third-party copyright, triple licensed under Unlicense, CC-0, and WTFPL, whatever suits your needs between corporate compliance, open source zealotry and software hacking. If I did not update the links to actual copies of the licenses, it was my fullest intent to do so and fill them with the missing information you could gather from the meta data on this page as well.)