Converting RGBA to Yuv420p in Glsl/Vulkan

This post is about doing an image conversion on the graphics card itself. This operation is necessary in your graphics pipeline to do video capture as the native image encoding of your target video codec (e.g. mpeg-4, vp9, …) is unlikely to be rgb. Doing the conversion within your pipeline promises a throughput and latency gain on video encoding compared to running a third-party screen capture software. Additionally, this nets you a bit of artistic freedom by putting the exact frame timings and cuts under your direct control.

First, we will go over the basics of color spaces followed by a (short) intro to compute shaders. I will not cover the details of the Vulkan API, there are better tutorials and guides in other places.

Color spaces

Very succinctly, the human eye typically perceives three different colors through three different kinds of receptors in your retina. The neurological response of each of these can be expressed through a convolution involving the intensities of incoming wave lengths on one hand and a specific sensitivity curve on the other hand.

A plot of three curves–one red, one green, one blue–illustrating the normalized cone response to different wave lengths of visible light from 400 to 700 nm. Each graph peaks at a specific wave length and falls off to both sides. The green and red curve overlap to a higher degree while the blue curve is further apart to the left of them. Vanessaezekowitz at en.wikipedia, https://en.wikipedia.org/wiki/File:Cones_SMJ2_E.svg

The differences in the response to specific light intensities is referred to as tristimulus and maps to ones understanding of color. Since perception of color now seems to depend mostly on the relative response of the three different stimuli–recall that convolution is linear–CIE xyY color representation shows the relative red (x) and green (y) parts together with the original response (Y), as a measure of luminance. The first two parts together are referred to as chromaticity.

Due to the linearity of the stimuli and the nature of the transform, it follows instantaneously that the mixture of two light sources appears on the connecting line between the light sources themselves. In particular, color representations form a convex set. Most color representations now choose three points on the plane to form a triangle covering the necessary colors. These three are called the primary colors and the area covered is called the gamut. The white point refers to the representation of a well-defined light source which is used to normalize the color representation. The conversion of one representation to another is then a linear transformation, expressed as a 3x3 matrix. All color values expressible in a representation lie inside the simplex (triangle) specified by the location of the primaries and components are in the interval [0; 1].

This model of course permits storing another set of values to represent the same color values. The YUV model does not store the values of the primaries but instead stores a luminance value and scaled differences between the luminance and the blue and red primaries to normalize the color space values against the differences in human perception. A particular distance should correspond roughly to the same perception difference across the complete space, which is not true in the case of CIE xyY. The main usability improvement is obtained when one also considers an analog or digital encoding of the color data, the latter originally referred to as YCbCr but now the terms are sometimes mixed or all encodings collectively named YUV.

By normalizing values, this ensures that image or video format specific lossy encodings specifically preserve perception. To further improve quantization errors a transfer function is introduced. This non-linear but monotonic transformation compresses values in the high-end of the linear values and is applied before a linear RGB to YCbCr mapping, to each component individually. This electro-optical function, as the name implies, maps color values from voltage signals. This is of particular importance for the analog part of equipment and the exponent used is called gamma. Since up-to-date functions do not use a simple power function, the gamma is either an approximation over the complete curve or refers to the power section only. We will see two different transfer functions later, when we discuss the details of sRGB and Rec.709.

A short note on the effects of doing something wrong: If an image displayed on your monitor does not have any differentiable colors in the blacks, it might be that some part of the display process is interpreting linear RGB as if it were sRGB, consequently wrongly uncompressing the high-end an additional time. But that will lose all the differences in the low-end of the color spectrum, especially if the image is quantized to 8-bit sample values. For a gamma of 2.2, it will already lose half the available bit depth in the process. As a simple test, a 10% (linear) gray should still be easily differentiable from black but would be compressed into 0.0389, which is quantized as 2 in 8-bit system. That means you lose almost all of the color information between that and black.

Compute shader setup

A compute shader behaves very much like a single-stage graphics shader, except instead of being executed once per vertex it is executed in workgroups, a collection of shader invocations, of which you control the size when dispatching. A parameter in this call provides the number of individual invocations specified as a 3d cuboid. Two special variable, gl_LocalInvocationId and gl_GlobalInvocationId, give the index of a single execution within this cuboid. Both have type uvec3. The global index is relative to the coboid specified as a parameter during dispatch while the local index is relative to a smaller cuboid, which is part of the complete invocation block. Since execution of a single invocation would be inefficient, the gpu will schedule a number of invocations simultaneously with the same execution flow.

This atomic group is the local invocation cuboid, or workgroup. It will (typically) share control flow, so try to avoid conditional blocks (if and for) that evaluate differently on two different invocations within the same local workgroup. In these cases the different code paths might be taken sequentially, essentially costing you performance for code paths that are not necessary in some invocations. As a compensation, instances in the same workgroup can share result locally and profit from some additional guarantees for atomic operations. The size of the local workgroup is determined by an annotation in glsl or automatically if none is found. However, we do not need any of this right now, the information is here purely for the sake of completion.

The YUV420p format

This is a planar image format, where planar refers to the fact that the memory representation groups the samples by component and not by pixel. It is based on a particular YCbCr color space, in our case BT.709 also known as Rec.709. In its encoding the Y values are followed by U and V. To save additional space, it only stores aggregated U and V samples, one for each two-by-two block in the original image.

Note: The transfer function differs between sRGB and Rec.709, the two color representation processed here. Consequently, we need to linearize any sRGB representation and then retransfer it to stay truly accurate. Any source that tells you to simply apply a linear transformation to sRGB in order to receive Rec.709 encoding is inaccurate. In a typical cpu bound encoding process, only working on integer data, this inaccuracy might be justified by the extreme saving in processor cycles and by avoid floating-point representations. These two issues are of no concern in a gpu setting, so we might as well do them correctly.

For the following transformations the color values are assumed to be normalized into the range [0; 1]. Only the quantization transformation is dependent on integer representations and bit-depth. The transfer function (luminance->voltage) for Rec.709 is given by:

    ⎧ 4.5*L  if L < 0.018
V = ⎨
    ⎩ 1.099*(L**0.45) − 0.099  if L ≥ 0.018

Compare this with sRGB, even though we will later see that is likely irrelevant for now as we don't have to do this manually:

    ⎧ 12.92*L  if L < 0.0031308
V = ⎨
    ⎩ 1.055*(L**1/2.4) − 0.055  if L ≥ 0.0031308

The multiplication matrix to derive YCbCr is specified as:

⎛Y' ⎞   ⎛ 0.2215  0.7154  0.0721⎞   ⎛R'⎞
⎜Cb'⎟ = ⎜-0.1145 -0.3855  0.5000⎟ · ⎜G'⎟
⎝Cr'⎠   ⎝ 0.5016 -0.4556 -0.0459⎠   ⎝B'⎠

Finally, quantization of values is performed to arrive at a digital signal. Here n is the target bit-depth:

DY'  = int[(219·Y'  + 16) ·2**(n-8)]
DCb' = int[(224·Cb' + 128)·2**(n-8)]
DCr' = int[(224·Cr' + 128)·2**(n-8)]

The shader

Each invocation of the shader will work on an 8-by-2 block of the input image. The result is then a complete vec4 of data for the u and v components. We further assume that the height is a multiple of four to make address computation a bit easier. That should make it much more efficient to store the result into the output image, since store and load operations only work on complete pixel values. The input to our shader will either of format ..Unorm or ..Srgb. In either case, the loaded value will already be in linear format, so we can skip applying the sRGB transfer function. This makes the process straighforward, although there might be room for optimization:

#version 450

layout( binding = 0, rgba8 ) uniform readonly image2D rgb;
layout( binding = 1, rgba8 ) uniform writeonly image2D result;

// Note that rec.709 and sRGB have the same primaries.
const mat3 mat_rgb709_to_ycbcr = mat3(
     0.2215,  0.7154,  0.0721,
    -0.1145, -0.3855,  0.5000,
     0.5016, -0.4556, -0.0459
);

float rgb709_unlinear(float s) {
    return mix(4.5*s, 1.099*pow(s, 1.0/2.2) - 0.099, step(0.018, s));
}

vec3 unlinearize_rgb709_from_rgb(vec3 color) {
    return vec3(
        rgb709_unlinear(color.r),
        rgb709_unlinear(color.g),
        rgb709_unlinear(color.b));
}

vec3 ycbcr_from_rgbp(vec3 color) {
    vec3 yuv = transpose(mat_rgb709_to_ycbcr)*color;
    vec3 quantized = vec3(
        (219.0*yuv.x + 16.0)/256.0,
        (224.0*yuv.y + 128.0)/256.0,
        (224.0*yuv.z + 128.0)/256.0);
    return quantized;
}

vec3 sRGB_to_yuv(vec3 color) {
    return ycbcr_from_rgbp(unlinearize_rgb709_from_rgb(color));
}

void main() {
    uint result_w = imageSize(rgb).x/4;

    uvec2 self_id = gl_GlobalInvocationID.xy;
    ivec2 coords = ivec2(self_id.x*8, self_id.y*2);

    vec3 yuv [16];

    int index_x, index_y;

    for(index_y = 0; index_y < 2; index_y += 1) {
    for(index_x = 0; index_x < 8; index_x += 1) {
        vec4 orig_color = imageLoad(rgb, coords + ivec2(index_x, index_y));
        vec3 yuv_color = sRGB_to_yuv(orig_color.rgb);
        yuv[index_y*8 + index_x] = yuv_color;
    } }

    // Store back the y values.
    for(index_y = 0; index_y < 2; index_y += 1) {
    for(index_x = 0; index_x < 2; index_x += 1) {
        int i = index_y*8 + index_x*4;
        vec4 yyyy = vec4(yuv[i].x, yuv[i+1].x, yuv[i+2].x, yuv[i+3].x) + 1.0/16.0;
    imageStore(result, ivec2(2*self_id.x + index_x, 2*self_id.y + index_y), yyyy);
    } }

    ivec2 top_right_u = ivec2(0, imageSize(rgb).y);
    ivec2 top_right_v = ivec2(0, imageSize(rgb).y + imageSize(rgb).y/4);

    float us[4];
    float vs[4];
    for(index_x = 0; index_x < 4; index_x += 1) {
        int i = index_x*2;
        vec4 uuuu = vec4(yuv[i].y, yuv[i+1].y, yuv[i+8].y, yuv[i+9].y);
        vec4 vvvv = vec4(yuv[i].z, yuv[i+1].z, yuv[i+8].z, yuv[i+9].z);
    us[index_x] = dot(uuuu, vec4(1.0))/4.0;
    vs[index_x] = dot(vvvv, vec4(1.0))/4.0;
    }

    // Group u and v output
    vec4 ucode = vec4(us[0], us[1], us[2], us[3]);
    vec4 vcode = vec4(vs[0], vs[1], vs[2], vs[3]);

    uint uv_sample_count = self_id.x + self_id.y*(imageSize(rgb).x/8);
    ivec2 relative = ivec2(uv_sample_count%result_w, uv_sample_count/result_w);

    imageStore(result, top_right_u + relative, ucode);
    imageStore(result, top_right_v + relative, vcode);
}

(Free from third-party copyright, triple licensed under Unlicense, CC-0, and WTFPL, whatever suits your needs between corporate compliance, open source zealotry and software hacking. If I did not update the links to actual copies of the licenses, it was my fullest intent to do so and fill them with the missing information you could gather from the meta data on this page as well.)