Futex and io_uring, a perfect match?
Let's explore the prospects of combining the futex system call with its usage
through io_uring
, in the light of recent system calls introduced for Linux
which offer unexpected depth.
Futex, a recap
The word futex
is an abbreviation for fast userspace mutex. This already
implies a lot of semantics, very concisely, but is a little reductive and we'll
see a few of these leaks if we look in detail. Explained in terms of the
stabilized mechanism it refers to a memory location available to user space,
historically an aligned 32-bit value, which the kernel knowns how to atomically
check-and-wake by strong atomicity guarantees of certain futex-related system
calls with respect to certain operations on these memory locations. The user
space is, in theory, free to choose any semantics for the value and any
(atomic) instructions to modify them. Very often in practice though, this is
used to implement a Mutex and in that special case a number of system calls
also interact with the scheduler's priority system to mitigate priority
inversion problems.
The main restriction is that the kernel itself will usually not write to the
memory location, and none of the current calls perform conditional writes. The
two exceptions are WAKE_OP
and a flag bit that may be set, but not cleared,
for priority-inversion owners. Do not mistake it for a mechanism to achieve any
form of transactional memory operations on its own.
The main benefit compared to a pure-userspace loop should be a protection against lost wakes, yet this too must be interpreted somewhat carefully. That is, if a thread performs a wake with the assertion of a specific value at the address, and another uses an atomic read-modify-write (RMW) or compare-and-swap (CAS) instruction to change the value before then waking futexes, then the waiter is guaranteed to either observe the new value and not be put to sleep, or to be woken.
If both programs live, you get one of two traces. One where the waiter is suspended and awoken:
T0: T1:
. *f <> 1
wait(f, 1) *f <> 0
(S) wake(f)
= 0 = 1
And the alternative trace where the waiter is never suspended and awoken:
T0: T1:
. *f <> 1
wait(f, 1) *f <> 0
= EAGAIN wake(f)
. = 0
Troubles with futex waking
The first problem with this couple is that they are not hazard-free. If the 'owner' of a futex, i.e. one in charge of modifying its value, just suddenly dies for any reason then what happens? Firstly, we might never reach the atomic CAS instruction relinquishing ownership. But problematic still, a thread might die in the former trace between its unlocking and wake operations, leading T0 suspended while other actors checking afterwards are kept in the belief that a wake was sent (on re-checking the state).
T0: T1:
. *f <> 1
wait(f, 1) .
(S) *f <> 0
(S) (dies)
This is bad. If we care about this problem, we can fix it by using a WAKE_OP
if the mutex can be released by an unconditional atomic write, with very
limited ways to introspect the original value. The WAKE_OP
command consists
of three steps:
- Perform certain read-modify-write on the secondary futex with a provided value.
- Perform a wake for waiters on a primary futex address.
- Conditionally, based on some simply conditions evaluated on the overwritten previous value, also perform a wake for waiters on a secondary futex address.
The owner of a mutex is usually assumed to have unique ownership of the current
lock, so it can choose to perform an unconditional write in the first step and
provide the same futex address for both operands to also wake unconditionally.
Alternatively, the condition check can be massaged to be unconditionally true
or false
by using static comparisons such as oldval < 0
.
Troubles with futex waiting
Historically, waiting is a synchronous operation interacting with a single
futex address. You put the thread itself to sleep with some specified expected
value so that it is resumed by a wake. It's not necessary that the wake has
actually modified the value, when the thread was suspended the value no
longer matters to the waiter. Side note: the waiter and waker side can agree on
selective wakes by tagging the waiter with a bitset and bitmask, though not all
waking operations will check the mask. Specifically WAKE_OP
as above does not
turning it into a quite restrictive use case.
The problem here is that of course everyone now needs to be aware of the thread being suspended on that futex if we need to resume it. Also this implies that a single futex may be awaited per thread at a time. This doesn't scale for interprocess communication. If a server process intends to serve multiple clients then the obvious analogy to sockets applies, and we need poll and the ability to avoid a thread-per-client requirement.
The news with io_uring
There are many exiting parts of these problems that the io_uring subsystem can address. Firstly, waiting is no longer synchronous since a specific operation is suspended instead of a thread. Secondly, a new vectored form of futex waiting allows one wait to be enqueued on multiple, though limited at 128, locations at once completing at the earliest wake. (Curiously, this means we could even atomically assert the value of locations which we aren't ever expecting to be woken on). And thirdly, we can use a ring FD as a form of exit handler to enqueue wakes to be triggered at abnormal process termination.
Since synchronicity is straightforward, let's jump into the new system call.
The futex_waitv
call
This new call is, from the signature similar to vectored reads and writes. Instead of handing over the parameters in individual registers, one hands over a pointer to a whole vector of individual operations and a count. Fully:
extern "system" {
unsafe fn _sys_futex_waitv(
_sys_futex_waitv: uapi::c::c_long,
waiters: *mut FutexWaitVec,
waiters_nr: uapi::c::c_int,
flags: uapi::c::c_int,
timeout: Option<&mut uapi::c::timespec>,
clock_source: uapi::c::clockid_t,
) -> uapi::c::c_long;
}
#[repr(C)]
pub struct FutexWaitVec<'lt> {
pub expected_value: u64,
/// The futex's address.
pub address: u64,
/// The address must reference memory and it must be alive.
_addr_owner: PhantomData<&'lt atomic::AtomicU32>,
/// The kind of futex wait.
pub flags: FutexFlags,
__reserved: u32,
}
#[repr(transparent)]
pub struct FutexFlags(pub u32);
The respective io_uring opcode is constructed accordingly, except that the
timeout mechanism is of course provided by the existing LINK_TIMEOUT
and thus
in a separate operation. A note on the flags, these allow separate memory
sizes. As noted in the introduction, the historical system calls pre-supposed a
fixed 32-bit value type whereas here the size is indicated by the user space
caller by a bit-masked value. Use 0x2
for the conventional semantics you're
used to.
Using a ring as an exit handler
- Setup a ring in a supervisor process, share it with the child.
- In the child, create a pipe (CLOEXEC).
- Submit two ops to the ring 3.1 The first op being a 1-byte read from the pipe read-end and a IO_HARDLINK 3.2 The second op being some externally visible operation such as FUTEX_WAKE
- Have the child die. The kernel will reap the write-end of the pipe. This close will make read resolve, and then by the link the action is triggered.
A demonstration of the scheme is here: shm-pbx, user-space ring communication. (You'll need a Linux 6.8 kernel to execute them for yourself).
Various variations of this schema are possible. For instance, instead of
sharing the file descriptor with a child it might be shared with another
mechanism. Sending it to a peer for instance, sending it to a supervisor (the
systemd file descriptor store). The operation in 3.2 need not be futex related
and it need not be a single operation. Can you demonstrate a use writes, maybe
even with pre-registered buffers and descriptors? The choice of waiting on the
closed pipe is not necessarily optimal. Maybe the process should open its own
pidfd
instead?