Preserved snapshot-restore memory with systemd

When a program dies unexpectedly, most of its state is discarded by the kernel and lost. This is problematic for a number of reason and bad for robustness. The web-scale solution to this problem is two fold: put internal state into a database, and external state into the kernel. Specifically, the init program (a program's reaper) can preserve open sockets and clone them to the restarted unit or a fallback.

Since recently, there's been the finding that treating sockets different from other file descriptors is not helpful. This idea has been implemented in systemd by offering an interface to register arbitrary file descriptors for persistence dynamically at runtime, which are then passed back on the next spawn of a service. This of course includes the ability to preserve shared memory files and thus some amounts of memory. Nice.

Interact with systemd

The interface is quite simple. The system passes the initial state through a bunch of environment variables, the program can pass information back by messaging through some given socket (generally AF_UNIX) with control data including file descriptors.

  • $LISTEN_PID denotes the pid from which systemd will accept messages and for which the file descriptors are configured. Since they are close-on-exec they are generally initially treated as mostly exclusive and thus modeled in Rust by using an OwnedFd or RawFd.
  • $LISTEN_FDS a count of file descriptors passed, the first of which will always refer to file descriptor 3.
  • $LISTEN_FDNAMES a colon separated list of names or descriptions that have been previously set up.

In the reverse connection, we send messages. The contents of this message is documented as man sd_notify, as a line-oriented protocol. The one we need is a combination of FDSTORE=1 and FDNAME=name. Do not include nul termination in the message.

Source code details: The whole notify interaction to store a socket

use std::env;
use std::ffi::{OsString, OsStr};
use std::os::fd::{AsRawFd, OwnedFd, RawFd};
use std::os::unix::ffi::OsStrExt;
use std::os::unix::net::UnixDatagram;

pub struct NotifyFd {
    fd: OwnedFd,
    addr: Vec<libc::c_char>,
}

// https://github.com/systemd/systemd/blob/414ae39821f0c103b076fc5f7432f827e0e79765/src/libsystemd/sd-daemon/sd-daemon.c#L454-L598
impl NotifyFd {
    pub fn new() -> Result<Option<Self>, std::io::Error> {
        let Some(addr) = env::var_os("NOTIFY_SOCKET") else {
            return Ok(None);
        };

        Self::from_env(addr).map(Some)
    }

    pub fn from_env(name: OsString) -> Result<Self, std::io::Error> {
        let ty = name.as_encoded_bytes().get(0).cloned();

        let name_bytes = match ty {
            Some(b'/') => {
                eprintln!("Path: {:?}", name);
                name.as_encoded_bytes()
            }
            Some(b'@') => {
                eprintln!("Anon: {:?}", name);
                &name.as_encoded_bytes()[1..]
            },
            _ => return Err(std::io::ErrorKind::Unsupported)?,
        };


        let name = OsStr::from_bytes(name_bytes);
        let dgram_socket = UnixDatagram::unbound()?;
        dgram_socket.connect(name)?;

        Ok(NotifyFd {
            fd: dgram_socket.into(),
            addr: name_bytes.iter().map(|&b| b as libc::c_char).collect(),
        })
    }

    // Consume the notify fd to send a FD notification.
    //
    // FIXME: That's what the c function is doing.
    // <https://github.com/systemd/systemd/blob/414ae39821f0c103b076fc5f7432f827e0e79765/src/libsystemd/sd-daemon/sd-daemon.c#L454C12-L454C40>
    //
    // It's utterly confusing why we'd open a full file descriptor for every single message but oh
    // well, here we are. The code sends the ucredentials and file descriptors as part of the
    // *control* data, not the message data, of course, that's how you pass file descriptors, but
    // it only sends control data once (even for streams). Thus we will only attempt at most one
    // message with file descriptors and thus this method must consume the NotifyFd.
    pub fn notify_with_fds(
        self,
        state: &str,
        fds: &[RawFd]
    ) -> Result<(), std::io::Error> {
        let mut hdr: libc::msghdr = unsafe { core::mem::zeroed::<libc::msghdr>() };
        let mut iov: libc::iovec = unsafe { core::mem::zeroed::<libc::iovec>() };
        let mut addr: libc::sockaddr_un = unsafe { core::mem::zeroed::<libc::sockaddr_un>() };

        iov.iov_base = state.as_ptr() as *mut libc::c_void;
        iov.iov_len = state.len();

        addr.sun_family = libc::AF_UNIX as libc::c_ushort;
        let addr_len = addr.sun_path.len().min(self.addr.len());
        addr.sun_path[..addr_len].copy_from_slice(&self.addr[..addr_len]);

        hdr.msg_iov = &mut iov;
        hdr.msg_iovlen = 1;
        hdr.msg_namelen = core::mem::size_of_val(&addr) as libc::c_uint;
        hdr.msg_name = &mut addr as *mut _ as *mut libc::c_void;

        // No send_ucred yet, hence
        let len = u32::try_from(core::mem::size_of_val(fds))
            .expect("user error");
        let len = if len > 0 {
            (unsafe { libc::CMSG_SPACE(len) } as usize)
        } else { 0 };

        let mut buf = vec![0; len];

        hdr.msg_controllen = len;
        hdr.msg_control = buf.as_mut_ptr() as *mut libc::c_void;

        if len > 0 {
            let cmsg = unsafe { libc::CMSG_FIRSTHDR(&hdr) };
            let cmsg = unsafe { &mut *cmsg };
            let msg_len = core::mem::size_of_val(fds);

            cmsg.cmsg_level = libc::SOL_SOCKET;
            cmsg.cmsg_type = libc::SCM_RIGHTS;
            cmsg.cmsg_len = unsafe { libc::CMSG_LEN(msg_len as u32) } as usize;

            assert!(cmsg.cmsg_len >= msg_len);
            let data = unsafe { libc::CMSG_DATA(cmsg) };

            // Safety: Pointer `data` is part of the buffer, by libc::CMSG_DATA.
            // Then fds is a pointer to an integer slice, always initialized.
            // Then the message length is the number of bytes in the slice.
            unsafe {
                core::ptr::copy_nonoverlapping(
                    fds.as_ptr() as *const _ as *const u8,
                    data,
                    msg_len,
                );
            }
        }

        let sent = unsafe {
            libc::sendmsg(self.fd.as_raw_fd(), &hdr, libc::MSG_NOSIGNAL)
        };

        if -1 == sent {
            return Err(std::io::Error::last_os_error());
        }

        if sent as usize != state.len() {
            return Err(std::io::ErrorKind::InvalidData)?;
        }

        Ok(())
    }
}

A whole bunch of optional authentication code path is not implemented with the above code (such as if our effective UID differs from the actual UID we need to tell the remote side of the socket) but this suffices.

Snapshot and Restore

The state in our in-memory file does of course not survive actual restarts of the computer. Nor is the mere existence a strategy for ensuring the consistency of the file contents in case of abnormal termination. The semantics of writes into the file are generally not covered well by the memory model. The compiler might have reordered and we could observe partial writes. We need another layer.

So if we solve this problem we might as well attempt to solve more of it. The shared memory file is, afterall, an attempt at replacing the database component with something that is simpler. So let's invent an ad-hoc protocol that enables a separate program, with independent fault behavior, to do read-only copies of the state into a persistent file. The file system and kernel are not much of a help here. Doing atomic freezes of memory ranges of another program is complicated to do securely without hardware support (there's was an attempt to solve this via FPGAs in the Operating Systems Chair of TUM under Prof. Dr. Uwe Baumgarten).

One solution is not to provide absolute guarantees, and just opt for an equivalent of optimistic locking. Alternatively, a fully lock-free interactive version could coordinate memory regions between the host process and the user process but is left for another day. We set up a self-describing header, as well as a ring-buffer of descriptors which are basically slot-map keys. Each descriptor consists of a monotonically incremented epoch and reference to a slice of data. We can simplify by conflating the pointer-portion and the epoch: since the pointer is only an offset and not a true address (which changes between runs) it can index into a ring of its own where offsets to the same data alias at each full repetition.

Crucially, we can set the validity of each descriptor with atomic memory instructions that also assert ordering of effects relative to writes of the data referred by each descriptor. Enclosing all changes of the data portion with properly sequenced buff buffer state changes allows the host a simple criterion to check immutability. If the descriptor is observed twice in the same state, then the data referenced has not been mutated between those observations either.

So our snapshot algorithm is sketched as such:

bck = TemporaryFile.in(target_dir)

pre = gather_valid_descriptors(mmap)
copy_file_range(mmap, bck)
post = gather_valid_descriptors(mmap)

not_mutated = set(pre) & set(post)
retain_only_descriptors_in_set(bck, not_mutated)

# Atomic file operation within the same file system.
TemporaryFile.persist(bck, target)

Repository

Where to find it? The code is published in two crates:

  • shm-fd that implements the host portion and setting up of a socket.
  • shm-snapshot to interact with the shared memory file with the above sketched protocol and enable atomic snapshots by the host.

The repository https://github.com/HeroicKatora/shmfd contains all the code including further descriptions on how to setup a systemd environment to configure the programs as services and run them. The proof-of-concept is an (artificially slowed down) prime sieve with the full table in the shared memory file state.

Published on