Elements of system call implementation and OpenSolaris/Illumos process model (Chapter 2). Reading (also referred to throughout this text): Pace yourself, don't plan to do this in one sitting! In Chapter 1: Skim through: 1.4 (to understand "lwp" in proc_t member names) 1.7 (keep in mind that we look at x86, not SPARC! for x86 details read the execellent blogposts by Gustavo Duarte: http://duartes.org/gustavo/blog/post/memory-translation-and-segmentation http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory) Chapter 2: As you read, keep looking at proc_t definition and try commands from 2.13 in "mdb -k" (must be root) Read carefully: 2.1 - 2.5 (especially 2.4 and Fig. 2.3) The figures are the best part of the book, spend time understanding them and following the links/pointers in code or actual memory (with mdb -k) 2.8 (again, keep in mind we are on x86) You can optionally start reading the following: 2.10 - 2.10.1 (the /proc filesystem, keep looking at the code in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/proc/prvnops.c and Fig. 2.10) 1. System calls as the centerpiece of a Unix kernel. All privileged operations in Unix are performed on behalf of user processes by "system call" code located in the kernel. The data that this code operates on is also located in the kernel and can only be directly accessed when the CPU is in "kernel mode". This ensures that user processes get to use this code only as a "package deal", with the up-front permission and sanity checks being a part of the package. This mechanism is the basis of the OS stability and security. 2. Some Linux details: User-level code accesses syscall code through the so-called "call gate" mechanism: it sets the number of the desired call in a register (EAX on Linux/x86), sets arguments or pointers to arguments in other registers (EBX, ECX, EDX, ... on Linux 32bits) and executes the "int 0x80" instruction (older 32bit systems), or "syscall" or "sysenter" instructions (newer and 64bit systems). Note that the system call function is accessed only by it number, not by its address, which user-level code cannot "jump" or "call" to (if it tries, a segfault occurs). The "int 0x80" instruction simultaneously puts the CPU into the kernel mode ("ring 0") and transfers control to the address stored in the 0x80-th slot of the x86 CPU's Interrupt Descriptor Table (which is pointed to by the CPU's special IDTR register). That address is *the single entry point* for all system calls. Look at the nice Fig. 1 in this IBM developer article on syscalls: http://www.ibm.com/developerworks/linux/library/l-system-calls/ Look at ENTRY(system_call) in an older Linux kernel: http://lxr.linux.no/linux+v2.6.24/arch/x86/kernel/entry_32.S Observe the sys_call_table on an older Linux kernel: http://lxr.linux.no/linux+v2.6.24/arch/x86/kernel/syscall_table_32.S Details on Linux system calls: http://www.ibm.com/developerworks/linux/library/l-system-calls/ 3. Some OpenSolaris details: Syscall numbers exposed in Solaris in: /etc/name_to_sysnum Syscall numbers defined in: /usr/src/uts/common/sys/syscall.h (http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/syscall.h) Syscalls dispatched in: /usr/src/uts/intel/ia32/os/syscall.c http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/syscall.c Observe: dosyscall() gets the address of the requested syscall function by "code" in syscall_entry() then executes it by function pointer (lines 896--898). System call table: usr/src/uts/common/os/sysent.c http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/sysent.c Observe: Line 439 and below, struct sysent sysent[NSYSCALL] = ... 4. A simple syscall: getpid() http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/getpid.c#42 Looks up the PID via the pointer to the current thread descriptor curthread (follows the pointer to the process structure of type proc_t, then locates the integer PID value through that -- see my MDB session in proc_t-kernel-view.txt) Kernel struct that keeps process data (alongside with some others, explained briefly on pp. 44--48 of the textbook, details in Section 2.4): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/proc.h#130 Suggestion: explore proc_t for the linking between process structs. How many other proc_t's are linked to it and why? (Many...) [Similar structure for Linux is task_struct: http://lxr.linux.no/linux+v2.6.24/include/linux/sched.h#L917 ] 5. Tracing a program's system calls. "truss" on OpenSolaris, "strace" on Linux trace all the syscalls made by an application. Try: truss echo Hello The tracing starts from the execve call that loads the binary for the echo command in the shell's path (/usr/gnu/bin/echo) and the subsequent mmap calls that load the segments od code and data from that binary. This binary is in the ELF format (cf. the output of "file /usr/gnu/bin/echo"). Before the code can run, all the necessary components of that dynamically linked file must be loaded (mmap-ed), such as the dynamic linker-loader itself (/usr/lib/ld.so.1), and the libc library (cf. "ldd /usr/gnu/bin/echo"). These dependencies are described in the EFL file format, and are parsed out of it by the kernel's binary format handler. Observe that all but the last five of these syscalls are for "setting up" so that the write() syscall can finally do the echo's job. Understanding how processes are set up requires some knowledge about the ELF format. 6. DTrace The textbook makes use of the DTrace tracing tool (which truss is built on): http://wiki.illumos.org/display/illumos/DTrace DTrace allows to observe unprecedented amounts of events happening on in the kernel, by placing "probes" throughout the kernel code and printing out and aggregating the information produced by them when execution reaches them. See dtrace-notes.txt or DTrace-User-Guide.pdf if you have questions about examples in Chapter 2. 7. Using the Modular Debugger (MDB) to examine kernel state "mdb -k" launches the debugger and "attaches" it to the running kernel (just as "mbd -p " would attach it to a running user process). The debugger has a somewhat different command syntax from GDB. Its ::help command is a useful entry point. Here is a (larger) tutorial: http://learningsolaris.com/docs/chpt_mdb_os.pdf (local copy: mdb-reference-chapter.pdf) In a hurry, skip the historical intro, go straight to examples of commands. Tip: You can pipe debugger commands' output through grep when there is too much of it, rather than dealing with the internal pager. E.g.: ::dcmds !grep module to catch all commands with names or descriptions that contain "module". To see the address (in hex) of a symbol (function, variable, or any other thing that the debugger knows the address of): =K (or =X on 32bit machines) E.g.: getpid=K -- the address of the getpid function as a 64bit hex number getpid=X -- the address of the getpid function as a 32bit hex number (on 64bit machine, this will just give you the lower part of the address, and no warning that the address is incomplete) To see the contents (in hex) of memory at that address, use / rather than = getpid/X -- the opcodes at the start of the getpid function as one 32bit hex number (little endian) getpid/4B -- the same, as 4 separate bytes, in order getpid/4i -- the opcodes disassembled into instructions (stops at first 4) More about formats: ::formats (e.g., "::formats !grep hex") getpid::dis -- disassembles the whole function ==== Useful commands to look up: ::ps ::pstree ::objects ::print -t "struct proc" -- print the definition of a data type ::print -a "struct proc" -- print the definition of a struct type with hex offsets of each member see also my MDB session looking at proc_t's (proc_t-kernel-view.txt) More info on this is in Chapter 2.4 ("Process Structures"). This is what ps and other process reporting utilities extract; we are going to see how. 8. Tracing ps ps on modern operating systems does little beyond reading /proc and interpreting and pretty-printing its contents. The kernel exposes its process control info in /proc's pseudo-files, and it is the kernel's functions that walk the process control blocks in response to your ps's "open" and "read" system calls. The design which re-dispatches these general file-related system calls to the appropriate worker functions is called VFS. Linux uses a similar design (except Linux's inodes are the same thing as Solaris' vnodes). See Figure 1.5 on p. 31 and table 1.1 on p.32 for an overview of VFS. Do this on Linux "strace ps" and on Illumos "truss ps" 9. Doing the work of /proc See what system calls ps makes with "truss ps". The getdents64 function is what lists the contents of a directory (this is what ls uses, too). The directory read by ps is, of course, /proc. See "man getdents". See d/procdents.d for a DTrace script that exposes kernel functions called in response to a ps commands' getdents call. Look at their code in the Solaris source browser. We will read the revealed functions' code to see how they walk the process list next time. 10. Reading kernel code The OpenGrok code browsing system is at http://src.illumos.org/source/ Kernel code "lives" under project "illumos-gate", under the path /illumos-gate/usr/src/uts/ (note UTS, which stands for Unix Time-Sharing, a very legacy name) Before we start reading kernel code in earnest, here are some idioms. ==== Functions defined in assembly ==== The ENTRY_* macros create function symbols that the linker will treat as normal C functions (when C functions are compiled into assembly, similar assembly is actually generated for them, too): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/asm_linkage.h#210 #define ENTRY_NP(x) \ .text; \ <--- place in .text (code) segment .align ASM_ENTRY_ALIGN; \ <--- align at 4 byte boundary .globl x; \ <--- make macro's arg a global symbol.. .type x, @function; \ <--- of type "function" x: <--- here it starts... ==== Getting pointer to current thread ==== http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/asm/thread.h -- thread pointer context: extern __inline__ struct _kthread *threadp(void) { void *__value; #if defined(__amd64) __asm__ __volatile__( "movq %%gs:0x18,%0" /* CPU_THREAD */ : "=r" (__value)); #elif defined(__i386) __asm__ __volatile__( "movl %%gs:0x10,%0" /* CPU_THREAD */ : "=r" (__value)); #else #error "port me" #endif return (__value); } For explanations of the __asm__ embedding of Assembly into gcc C code, see http://www.ibm.com/developerworks/library/l-ia.html, or http://www.cs.virginia.edu/~clc5q/gcc-inline-asm.pdf (local copy: gcc-inline-asm.pdf) for more details. (Note: for functions that include assembly, the kernel contains a "__lint" version of the code that does not actually get built but keeps the compiler in checking ("lint") mode happy. For more info see manpage of "lint"). (Many macros in /illumos-gate/usr/src/uts/common/sys/thread.h are nice and readable; the key point is threadp(), which is CPU-dependent.) For explanations of "extern __inline__" see (**). Here is how it is used: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/thread.h#528 extern kthread_t *threadp(void); /* inline, returns thread pointer */ #define curthread (threadp()) /* current thread pointer */ #define curproc (ttoproc(curthread)) /* current process pointer */ #define curproj (ttoproj(curthread)) /* current project pointer */ #define curzone (curproc->p_zone) /* current zone pointer */ cf: in getpid() code: int64_t getpid(void) { rval_t r; proc_t *p; p = ttoproc(curthread); <--- will access local thread storage off %gs system call will make sure %gs segment selector is right for the process on behalf of which the system call is made, i.e., points to the right proc_t . r.r_val1 = p->p_pid; if (p->p_flag & SZONETOP) r.r_val2 = curproc->p_zone->zone_zsched->p_pid; else r.r_val2 = p->p_ppid; return (r.r_vals); } ================================================================ (**) extern __inline__ explained: http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp?topic=/com.ibm.xlcpp9.bg.doc/language_ref/cplr243.htm -- "If you specify the __inline__ keyword, with the trailing underscores, the compiler uses the GNU C semantics for inline functions. In contrast to the C99 semantics, a function defined as __inline__ provides an external definition only; a function defined as static __inline__ provides an inline definition with internal linkage (as in C99); and a function defined as extern __inline__, when compiled with optimization enabled, allows the co-existence of an inline and external definition of the same function. For more information on the GNU C implementation of inline functions, see the GCC documentation, available at http://gcc.gnu.org/onlinedocs/." Why all this? See http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#2402 -- a different definition in another file, and yet no linking problem)