Learning about program execution in Linux from the output of strace
I recently started reading Michael Kerrisk’s The Linux Programming Interface. So far, I’ve been enjoying it thoroughly. Inspired by it, I took a look at what strace
outputs for a very simple “hello world” program, with the hopes of understanding what’s in it.
Relatively early in the book, Kerrisk introduces the concept of system calls:
A system call is a controlled entry point into the kernel, allowing a process to request that the kernel perform some action on the process’s behalf. The kernel makes a range of services accessible to programs via the system call application programming interface (API). These services include, for example, creatig a new process, performing I/O, and creating a pipe for interprocess communication. (The syscalls(2) manual page lists the Linux system calls.)
The strace
program traces system calls and signals, and can attach to a running process or be specified a command to run. In this exercise, I’m going to write a simple “hello world” program in C and run strace
with the compiled output.
#include <stdio.h>
int main() {
printf("Hello world!");
return 0;
}
I compile it with GCC:
$ gcc -o hello.out ./hello.c
Now we can run it like so:
$ ./hello.out
Hello world!%
Let’s take a first look at the output of strace
without any special flags:
$ strace ./hello.out
execve("./hello.out", ["./hello.out"], 0x7fffcc590420 /* 113 vars */) = 0
brk(NULL) = 0x55f7df1ee000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff59ff3fe0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=75074, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 75074, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fba4a414000
close(3) = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\206\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0+H)\227\201T\214\233\304R\352\306\3379\220%"..., 68, 896) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1983576, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fba4a412000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2012056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fba4a226000
mmap(0x7fba4a24c000, 1486848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7fba4a24c000
mmap(0x7fba4a3b7000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x191000) = 0x7fba4a3b7000
mmap(0x7fba4a403000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1dc000) = 0x7fba4a403000
mmap(0x7fba4a409000, 33688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fba4a409000
close(3) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fba4a224000
arch_prctl(ARCH_SET_FS, 0x7fba4a413580) = 0
mprotect(0x7fba4a403000, 12288, PROT_READ) = 0
mprotect(0x55f7dd843000, 4096, PROT_READ) = 0
mprotect(0x7fba4a45f000, 8192, PROT_READ) = 0
munmap(0x7fba4a414000, 75074) = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
brk(NULL) = 0x55f7df1ee000
brk(0x55f7df20f000) = 0x55f7df20f000
write(1, "Hello world!", 12) = 12
exit_group(0) = ?
+++ exited with 0 +++
At first, I’m surprised that there are so many system calls for such a simple program. The system calls executed the most were mmap
(8 times) and pread64
(4 times).
Output from strace
(prior to the use of any flags) is in the format:
syscall(...arguments) = return_value
The first lines are about loading our program for execution, so let’s take a look at that!
Initializing program execution: execve
#
The first system call execve
executes our program, whose executable can be found at its first argument pathname
, and indicates some additional info:
# pathname arguments &env[] env[].length
1 execve("./hello.out", ["./hello.out"], 0x7fffcc590420 /* 113 vars */) = 0
In the arguments array, the first is simply the name of the program, and I didn’t pass any more arguments. By default, only a pointer to the environment variables array is printed, together with a number indicating how many environment variables are defined. The strace
flag -v
makes the output verbose by dereferencing pointers such as the one for env[]
, and -s
lets us set the maximum length of the string representation of these values.
$ strace -s 10000 -v ./hello.out
execve("./hello.out", ["./hello.out"], ["ENV_VAR_1=VALUE", "ENV_VAR_2=VALUE"]) = 0
...
So what exactly is meant by “executes our program”? The man
page for execve
states the following:
execve()
executes the program referred to by pathname. This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initial‐ ized and uninitialized) data segments.
This leaves us with the question: which currently-running program is execve
replacing? I called ./hello.out
from a shell. Presumably, if my program were to substitute the shell process, then I wouldn’t have a shell to come back to when my program ends, and the terminal would exit.
Creating a child process: fork
#
Programs in Linux are executed by means of a mechanism called fork-exec. The fork
system call creates a child process by duplicating the calling process.
This means when execve
gets called, it’s substituting the cloned instance of our shell in memory, so that the shell process doesn’t cease to exist. You can try opening a new terminal and running a program using exec
(as in exec ./hello.out
), and you’ll see your terminal closes afterwards!
At the end of execve
our process for hello.out
has received a newly initialized stack, heap and data segments as fresh new virtual address space. In memory, it looks a little bit like this:
To see what exactly the kernel is doing with the memory of your process, you can run pmap $PID
or examine the contents of the files in /proc/$PID/maps
, where $PID
is the process ID of the process you want to investigate.
At the end of execve
, a function called search_binary_handler
gets called. It checks the hello.out
for a matching magic number (a few bits at the start of the file) to find a handler. The binary handler that matches our program is binfmt_aout.c
, which handles .out
-style ELF files as produced by GCC. Then, binfmt_aout::load_aout_binary
ends up calling begin_new_exec
, marked as the “point of no return” in the Linux kernel source, after which any failure results in a fatal signal.
From then on, the kernel takes care of memory-mapping anything required by the executable. The code for this is architecture-dependent (see, for example, different implementations of activate_mm
in the kernel). Ultimately, the kernel configures the CPU context to kickstart execution of our process.
P.S. You may be wondering why fork
didn’t show up in the output of strace
. The answer is simply that strace
’s tracing begins after it has forked in order to execute the program passed in as an argument. Thanks, Kenster on StackOverflow.