A barebones Linux binary for x86-64

Diving into Rust coming from a C background many newcomers are often badly surprised by the binary size of a very basic Rust binary. Much of this is caused by linking to the standard library. There are already lots of articles on techniques for minimizing the size of your binaries after you've built but I want to discuss an approach for approaching this from the opposite end. Starting from as little as possible, how much do we actually need for a standard binary to 'run'?

Disaster Tourism: Running with a libc

Let's entertain for a moment using a target with a 'platform' implementation. We quickly run into much rubble. As it turns out, the setup does actually quite a lot. When we want to execute a main, which code actually transfers control? Which code will handle the abnormal termination? While the latter was stabilized in the form of a panic handler attribute, running a main function is still complex. Let's just try to do nothing.

#![no_std]
use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

fn main() {}

error: requires `start` lang_item

error: could not compile `no_std_hi` (bin "no_std_hi") due to previous error

Huh? You see, the platform does not only come with an initialization, each such C platform also comes with its own mechanism for transferring control. The contract to tell Rust which function to match to said mechanism is a language item, #[start]. And since the signature and all ABI concerns depend on the platform outside of Rust's own control, none of this is remotely stable. We now have two paths: we can skip Rust's mechanism of transferring control, or we can use nightly. As it turns out, they lead to similar outcomes on Linux/libc at least.

For completeness sake, expand to see the nightly #[start] variant.

#![no_std]
#![feature(start)]
use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[start]
fn main(_: isize, _: *const*const u8) -> isize { 0 }

#[no_mangle] extern "C" fn __libc_start_main() -> ! {
    main(0, core::ptr::null());
    loop {}
}

#![no_std]
#![no_main]
use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[no_mangle]
fn main() { }

error: linking with `cc` failed: exit status: 1
  |
  = note: LC_ALL="C" PATH="/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/r/.opam/5.0.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/emscripten:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/lib/rustup/bin:/r/.cargo/bin:/r/intel/bin:/r/.local/bin:/r/bin:/r/build/x10-2.5.4/x10.dist/bin:/r/tree/usr/bin:/r/tree/bin:/r/tree/arm-linux-musleabihf/bin:/r/.cargo/bin:/r/.local/bin:/r/.local/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustcAEfotu/symbols.o" "/tmp/no_std_hi/target/release/deps/no_std_hi-7ac8a610109fff56.no_std_hi.59b3cadcf3826e40-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/tmp/no_std_hi/target/release/deps" "-L" "/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-0577018320f99037.rlib" "/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-193cf992125ccd4c.rlib" "/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-8e138eaf26ebb4a8.rlib" "-Wl,-Bdynamic" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/r/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/tmp/no_std_hi/target/release/deps/no_std_hi-7ac8a610109fff56" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs" "-fuse-ld=lld"
  = note: ld.lld: error: undefined symbol: __libc_start_main
          >>> referenced by /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../lib/Scrt1.o:(_start)
          collect2: error: ld returned 1 exit status

This issues results as part of control transfer. The exact sequence allows libc to provide an extension point for 'standard library' setup. Let's supply this ourselves. This is quite a lot of trial and error and you should not trust me in the slightest for this being stable. But the function is intended to end in a call to main with the standard argc/argv arguments, and never return. Since we control both sides, we can cheat a little and skip the arguments:

#![no_std]
#![no_main]
use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

#[no_mangle]
fn main() { }

#[no_mangle] extern "C" fn __libc_start_main() -> ! {
    main();
    loop {}
}

Running a 'minimal' GNU libc target

The last code fragment from the prior section actually finally compiles and loops infinitely just fine. Let's break out strace and take a look under the hood, what's actually involved in setting up the C environment implied by our target?

execve("./target/release/no_std_hi", ["./target/release/no_std_hi"], 0x7ffd703dad40 /* 74 vars */) = 0
brk(NULL)                               = 0x64c4a81de000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd9b4196d0) = -1 EINVAL (Das Argument ist ungültig)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x72d0a8aa1000
arch_prctl(ARCH_SET_FS, 0x72d0a8aa1a80) = 0
set_tid_address(0x72d0a8aa1d50)         = 101321
set_robust_list(0x72d0a8aa1d60, 24)     = 0
rseq(0x72d0a8aa23a0, 0x20, 0, 0x53053053) = 0
mprotect(0x64c4a6257000, 4096, PROT_READ) = 0

Woah. Scary stuff. Line-by-line:

execvc: is part of strace switching into the actual binary.
brk: is used to discover the data segment address.
arch_prctl: just tries to initialize Control-Flow Enforcement Technology, and fails. I can't recall asking rustc or libc to do this.
access is part of glibc's convoluted and massive system for dynamically overriding symbols that would have been dynamically linked. No one asked for this, either, and we shouldn't need to overwrite a symbol.
mmap grabs some memory pages from the operating system to use for thread-local storage. We' not going to use it.
arch_prctl, set_tid_address, set_robust_list all inform the kernel how to interact with our thread-local memory. That is, where to put and read structures that differ for each thread. Note these all point into the dynamically memory.
rseq, restartable sequence, is a sadly niche feature where a Linux thread can run a section of instructions without having been interrupted by the scheduler (or rather, re-entering at a different program address if the thread has been pre-empted while inside the block). The call configures one global structure for the whole thread to interact with the kernel, i.e. tells the kernel how to verify whether such a critical section is running. Note how glibc has taken the liberty to deny this choice from us while also being pointless. We're not using rseq.
mprotect sets up a 'guard page', to diagnose stack overflows instead of having them be undefined behavior. Our program has bounded stack usage.

Running without libc

The next logical dependency to avoid after #![no_std] is clearly the link to libc. Not only is it costly to link, even if we consider dynamic linking allowable, but it is not fully transparent to rustc to do so. And as we've seen, even the initialization has costs. There's a target which does not have these costs: x86_64-unknown-none. Just drop this in .cargo/config.toml and install the target via rustup.

# Requires: rustup target install x86_64-unknown-none
[build]
target = "x86_64-unknown-none"

Helpfully, the elf standard specifies an entrypoint we can immediately utilize without having to involve Linker Scripts at all: _start. Since we're not initializing any platform control transfer, we now need to use #![no_main]. Note how main refers only to the control transfer part, we still have our 'platform's' semantics of an initial entrypoint. We just get control.

#![no_std]
#![no_main]

#[no_mangle]
pub extern "C" fn _start() -> ! {
    loop {}
}

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

And that comes to: 640 bytes after a strip (which you can even configure in the Cargo profile).

Doing a hello world

We've broken the chains that bind us, cast away the cords of libc linkage, let's start doing useful stuff. First on the list is interacting with the operating system. For this we need to break out our assembly. The interaction with the kernel occurs by moving values into the right registers, yield control with the right instruction, and resuming control back with a different process state.

How exactly we do the assembly interaction is completely up to us here. An inline asm! macro, linking another object, or blindly re-interpreting a byte array of instructions as if it was a function symbol. (The later is my favorite bit of magic, but pick your poison). The Linux ABI is very similar to the System V C ABI which affords us a great deal of leniency if we declare our function properly. Most interestingly, the system call identifier—a simple register sized integer—is not equivalent to the first argument despite libc defining its signature this way. This means, our Rust function should end up more efficient than C once we've convinced the compiler to inline the setup to our syscall primitive.

pub unsafe extern "C" fn call3(_: isize, _: isize, _: isize, NR: SysNr) -> isize {
    // Move fourth argument to %rax, scratch register
    "mov %rax,%rcx";
    // Other arguments already set up by the SysV ABI
    "syscall";
    // Return value delivered by the kernel in %rax,%rdx already.
    "ret";
}

The system calls for 0 through 2 arguments are similar, with a different register for the call number argument in the System V ABI. At 4 upwards we need to reshuffle a few registers due to %rcx not being used by the kernel. Now we just need to find the magic system call numbers and cast all our arguments to register sized values:

#![no_std]
#![no_main]

fn exit() -> ! {
    unsafe {
        syscall_linux_raw::syscall1(syscall_linux_raw::SysNr(60), 0);
        core::hint::unreachable_unchecked();
    }
}

fn write(fd: usize, buf: &[u8]) {
    unsafe {
        syscall_linux_raw::syscall3(syscall_linux_raw::SysNr(1), fd as isize, buf.as_ptr() as isize, buf.len() as isize);
    }
}

#[no_mangle]
pub extern "C" fn _start() -> ! {
    write(1, "Hello, world!\n".as_bytes());
    exit();
}

use core::panic::PanicInfo;

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    loop {}
}

Segmentation fault

One final push

WAT! But rust works if it compiles, what is going on? We forget something that the platform does provide, after all. Inspecting the failure with gdb reveals we're actually segfault'ing on the call instruction to syscall3. And we're segfaulting at an address 0x0? It's as if the symbol is just missing. A call to objdump reveals the answer here. If we ask it to disassemble for us:

$ objdump -dx target/x86_64-unknown-none/release/examples/simple

0000000000001260 <_ZN6simple5write17ha62c0a9d479e57fcE>:
    1260:	48 8d 35 d9 ef ff ff 	lea    -0x1027(%rip),%rsi        # 240 <_ZN6simple4exit17h5340d43b981301c5E-0x1010>
    1267:	bf 01 00 00 00       	mov    $0x1,%edi
    126c:	ba 0e 00 00 00       	mov    $0xe,%edx
    1271:	b9 01 00 00 00       	mov    $0x1,%ecx
    1276:	ff 25 5c 11 00 00    	jmp    *0x115c(%rip)        # 23d8 <_DYNAMIC+0xe8>

The last line, # 23d8 <_DYNAMIC+0xe8>, is a marker inserted by objdump to inform us of a relocation. A relocation is an entry in the ELF headers which reference an address that was not fully determined by the linker. Instead, the location of the relocation contains only an offset relative to something else (there are actually several formats for this offset) which are supposed to be replaced with the final symbol location at a later point. This is done so that the file and memory layout can be determined for subsections of a binary file. Another use case is to make code position independent, that is to make the assembled section independent of its base address in memory which allows the section to be reshuffled and allocated at runtime. For instance, it may be necessary to accommodate other sections not known ahead of time as in dynamically loaded code.

We don't want to fix ourselves at runtime! We did static linking so that we do not deal with this problem, damn it. Well, let's tell rustc about this explicitly. For this we add one final entry to .cargo/config.toml.

[build]
target = "x86_64-unknown-none"
rustflags = ["-C", "relocation-model=static"]

And drum roll for 888 bytes of binary:

Hello, world!

execve("../target/x86_64-unknown-none/release/examples/simple", ["../target/x86_64-unknown-none/re"...], 0x7ffd369ba040 /* 74 vars */) = 0
write(1, "Hello, world!\n", 14Hello, world!
)         = 14
exit(0)                                 = ?

Hardmo.de