Futex and io_uring, a perfect match?

Let's explore the prospects of combining the futex system call with its usage through io_uring, in the light of recent system calls introduced for Linux which offer unexpected depth.

Futex, a recap

The word futex is an abbreviation for fast userspace mutex. This already implies a lot of semantics, very concisely, but is a little reductive and we'll see a few of these leaks if we look in detail. Explained in terms of the stabilized mechanism it refers to a memory location available to user space, historically an aligned 32-bit value, which the kernel knowns how to atomically check-and-wake by strong atomicity guarantees of certain futex-related system calls with respect to certain operations on these memory locations. The user space is, in theory, free to choose any semantics for the value and any (atomic) instructions to modify them. Very often in practice though, this is used to implement a Mutex and in that special case a number of system calls also interact with the scheduler's priority system to mitigate priority inversion problems.

The main restriction is that the kernel itself will usually not write to the memory location, and none of the current calls perform conditional writes. The two exceptions are WAKE_OP and a flag bit that may be set, but not cleared, for priority-inversion owners. Do not mistake it for a mechanism to achieve any form of transactional memory operations on its own.

The main benefit compared to a pure-userspace loop should be a protection against lost wakes, yet this too must be interpreted somewhat carefully. That is, if a thread performs a wake with the assertion of a specific value at the address, and another uses an atomic read-modify-write (RMW) or compare-and-swap (CAS) instruction to change the value before then waking futexes, then the waiter is guaranteed to either observe the new value and not be put to sleep, or to be woken.

If both programs live, you get one of two traces. One where the waiter is suspended and awoken:

T0:         T1:

.           *f <> 1
wait(f, 1)  *f <> 0
 (S)        wake(f)
= 0         = 1

And the alternative trace where the waiter is never suspended and awoken:

T0:         T1:

.           *f <> 1
wait(f, 1)  *f <> 0
= EAGAIN    wake(f)
.           = 0

Troubles with futex waking

The first problem with this couple is that they are not hazard-free. If the 'owner' of a futex, i.e. one in charge of modifying its value, just suddenly dies for any reason then what happens? Firstly, we might never reach the atomic CAS instruction relinquishing ownership. But problematic still, a thread might die in the former trace between its unlocking and wake operations, leading T0 suspended while other actors checking afterwards are kept in the belief that a wake was sent (on re-checking the state).

T0:         T1:

.           *f <> 1
wait(f, 1)  .
 (S)        *f <> 0
 (S)        (dies)

This is bad. If we care about this problem, we can fix it by using a WAKE_OP if the mutex can be released by an unconditional atomic write, with very limited ways to introspect the original value. The WAKE_OP command consists of three steps:

  • Perform certain read-modify-write on the secondary futex with a provided value.
  • Perform a wake for waiters on a primary futex address.
  • Conditionally, based on some simply conditions evaluated on the overwritten previous value, also perform a wake for waiters on a secondary futex address.

The owner of a mutex is usually assumed to have unique ownership of the current lock, so it can choose to perform an unconditional write in the first step and provide the same futex address for both operands to also wake unconditionally. Alternatively, the condition check can be massaged to be unconditionally true or false by using static comparisons such as oldval < 0.

Troubles with futex waiting

Historically, waiting is a synchronous operation interacting with a single futex address. You put the thread itself to sleep with some specified expected value so that it is resumed by a wake. It's not necessary that the wake has actually modified the value, when the thread was suspended the value no longer matters to the waiter. Side note: the waiter and waker side can agree on selective wakes by tagging the waiter with a bitset and bitmask, though not all waking operations will check the mask. Specifically WAKE_OP as above does not turning it into a quite restrictive use case.

The problem here is that of course everyone now needs to be aware of the thread being suspended on that futex if we need to resume it. Also this implies that a single futex may be awaited per thread at a time. This doesn't scale for interprocess communication. If a server process intends to serve multiple clients then the obvious analogy to sockets applies, and we need poll and the ability to avoid a thread-per-client requirement.

The news with io_uring

There are many exiting parts of these problems that the io_uring subsystem can address. Firstly, waiting is no longer synchronous since a specific operation is suspended instead of a thread. Secondly, a new vectored form of futex waiting allows one wait to be enqueued on multiple, though limited at 128, locations at once completing at the earliest wake. (Curiously, this means we could even atomically assert the value of locations which we aren't ever expecting to be woken on). And thirdly, we can use a ring FD as a form of exit handler to enqueue wakes to be triggered at abnormal process termination.

Since synchronicity is straightforward, let's jump into the new system call.

The futex_waitv call

This new call is, from the signature similar to vectored reads and writes. Instead of handing over the parameters in individual registers, one hands over a pointer to a whole vector of individual operations and a count. Fully:

extern "system" {
    unsafe fn _sys_futex_waitv(
        _sys_futex_waitv: uapi::c::c_long,
        waiters: *mut FutexWaitVec,
        waiters_nr: uapi::c::c_int,
        flags: uapi::c::c_int,
        timeout: Option<&mut uapi::c::timespec>,
        clock_source: uapi::c::clockid_t,
    ) -> uapi::c::c_long;
}

#[repr(C)]
pub struct FutexWaitVec<'lt> {
    pub expected_value: u64,
    /// The futex's address.
    pub address: u64,
    /// The address must reference memory and it must be alive.
    _addr_owner: PhantomData<&'lt atomic::AtomicU32>,
    /// The kind of futex wait.
    pub flags: FutexFlags,
    __reserved: u32,
}

#[repr(transparent)]
pub struct FutexFlags(pub u32);

The respective io_uring opcode is constructed accordingly, except that the timeout mechanism is of course provided by the existing LINK_TIMEOUT and thus in a separate operation. A note on the flags, these allow separate memory sizes. As noted in the introduction, the historical system calls pre-supposed a fixed 32-bit value type whereas here the size is indicated by the user space caller by a bit-masked value. Use 0x2 for the conventional semantics you're used to.

Using a ring as an exit handler

  1. Setup a ring in a supervisor process, share it with the child.
  2. In the child, create a pipe (CLOEXEC).
  3. Submit two ops to the ring 3.1 The first op being a 1-byte read from the pipe read-end and a IO_HARDLINK 3.2 The second op being some externally visible operation such as FUTEX_WAKE
  4. Have the child die. The kernel will reap the write-end of the pipe. This close will make read resolve, and then by the link the action is triggered.

A demonstration of the scheme is here: shm-pbx, user-space ring communication. (You'll need a Linux 6.8 kernel to execute them for yourself).

Various variations of this schema are possible. For instance, instead of sharing the file descriptor with a child it might be shared with another mechanism. Sending it to a peer for instance, sending it to a supervisor (the systemd file descriptor store). The operation in 3.2 need not be futex related and it need not be a single operation. Can you demonstrate a use writes, maybe even with pre-registered buffers and descriptors? The choice of waiting on the closed pipe is not necessarily optimal. Maybe the process should open its own pidfd instead?

Published on