Linux Syscall

Burt Rosenberg
Fall semester, 2010

System calls on the Intel architecture involve a special trap instruction. This instruction both modifies the flow of control and the privilege state of the processor. After the trap, the processor is running in kernel mode, with access to the entire physical memory, with return information on a special stack reserved for kernel operations, running a special trap handling function.

Setting the target of this trap is a very deep operation which is performed by the operating system during boot. The processor has a register which needs to be loaded with the base address of a table. The table is a sequence of addresses, each address a function for handling the various interrupts and exceptions, including the trap. This table is called the interrupt vector. Both that table is initialzed and the processor register set during boot and determine the one and only entry point into the kernel for the purpose of system services. This trap and the function that receives the trap is called the syscall.

The setting of the kernel stack is more complicated. When a trap occurs, it is important that the kernel have a private and trusted stack to handle the trap. The kernel cannot continue to run on the user's stack, for many reasons. There is a choice here: either have a kernel stack associated with each process or a single kernel stack used universally. I believe linux choses to have a stack per process, and this means that the trap configuration for the processor must be modified with each context switch.

The syscall code is found in entry.S (for the i386 architecture). Search for the use of symbol nr_syscalls to find an example function. In order that this one function server the entire API of kernel services, a differentiator is provided as a necessary argument to the syscall. In Linux, and certainly many other operating systems, it is an integer. Each service of the API is assigned a number. The syscall function then looks in a table, the syscall_table, to the location in the table offset by this number, and finds there the address of the function implementing the system call.

More on traps

Sorry about this digression, I'm so interested that you know this but this isn't really the exact right moment. So skip over this section if you want.

Oh good, you didn't skip over.

A trap is one of a family of trap-like operations or events. The names and taxonomy will vary from processor to processor, however they all must break down according to the same broad principles. Intel defines Interrupt and Exceptions.

The non-programmed exceptions are somehow errors. They are classified as Faults, Traps, or Aborts.

So I'm lying. The syscall is not a trap, it's a programmed exception. But that does not indeed sound cool. Now you wish you skipped over.

System services

The syscall and the syscall table are the fundamental mechanism for getting system services. Bookending this mechanim is a layer of code on the user side that presents the syscall in a user-friendly way, and, on the other side, a kernel function which implements the service.

For example, the syscall nice is number 34, and is the 34-th entry in the syscall_table. The call will be implemented somewhere in the kernel. nice is implemented, under the name sys_nice in sched.c. See also setpriority in kernel/sys.c and unistd.h On the user side, this 34 comes from /usr/include/asm/unistd_32.h.

Use man 2 nice to get information on thie function. Section 2 of the unix manpages is system calls. See also getpriority and setpriority.

There must be a convention for the passing of arguments are return values. The argument passing is signaled by the asmlinkage keyword. The compiler takes care of the rest. The return convention is that return 0 means the syscall completed succesfully.

The other bookend concerns the user code. This is not part of the kernel, and is therefore not part of linux. Rather this would be GNU, and in paricular, the C library, libc. The system call numbers are defined in the common and public header file unistd.h. There is a defined macro to make the system call, which is in fact an int 0x80 command.

Code References

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Last modified: 3 September 2010