Start reading the textbook! In Chapter 1: Skim through: 1.4 (to understand "lwp" in proc_t member names) 1.7 (keep in mind that we look at x86, not SPARC! for x86 details read the execellent blogposts by Gustavo Duarte: http://duartes.org/gustavo/blog/post/memory-translation-and-segmentation http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory) Chapter 2: As you read, keep looking at proc_t definition and try commands from 2.13 in "mdb -k" (you must be root to run it!) Read carefully: 2.1 - 2.5 (especially 2.4 and Fig. 2.3) The figures are the best part of the book, spend time understanding them and following the links/pointers in code or actual memory (with mdb -k) 2.8 (again, keep in mind we are on x86) You can optionally start reading the following: 2.10 - 2.10.1 (the /proc filesystem, keep looking at the code in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/proc/prvnops.c and Fig. 2.10) 1. System calls as the centerpiece of a Unix kernel. All privileged operations in Unix are performed on behalf of user processes by "system call" code located in the kernel. The data that this code operates on is also located in the kernel and can only be directly accessed when the CPU is in "kernel mode". This ensures that user processes get to use this code only as a "package deal", with the up-front permission and sanity checks being a part of the package. This mechanism is the basis of the OS stability and security. 2. Some Linux details: User-level code accesses syscall code through the so-called "call gate" mechanism: it sets the number of the desired call in a register (EAX on Linux/x86), sets arguments or pointers to arguments in other registers (EBX, ECX, EDX, ... on Linux 32bits) and executes the "int 0x80" instruction (older 32bit systems), or "syscall" or "sysenter" instructions (newer and 64bit systems). Note that the system call function is accessed only by it number, not by its address, which user-level code cannot "jump" or "call" to (if it tries, a segfault occurs). The "int 0x80" instruction simultaneously puts the CPU into the kernel mode ("ring 0") and transfers control to the address stored in the 0x80-th slot of the x86 CPU's Interrupt Descriptor Table (which is pointed to by the CPU's special IDTR register). That address is *the single entry point* for all system calls. Look at the nice Fig. 1 in this IBM developer article on syscalls: http://www.ibm.com/developerworks/linux/library/l-system-calls/ Look at ENTRY(system_call) in an older Linux kernel: http://www.cs.dartmouth.edu/~sergey/cs108/rootkits/entry.S Note the "call *sys_call_table(,%eax,4)" intruction. According to the Linux syscall calling convention, the system call number is passed in EAX, and 4 is the pointer length in 32 bit systems, so this just calls the implementation of the system call through its function pointer in the EAX-th slot of the sys_call_table, which is a table of function pointers. In the above file, the common entry point for system calls is the ENTRY(system_call) at line 241. (The ENTRY macro creates a linkable symbol for the linker to pick up later; you can see this symbol in your /boot/System.map file, which contains all public kernel symbols.) Note the saving of all userland registers by the SAVE_ALL macro, and the switch to a fixed value __USER_DS in the data segment selectors (%ds, %es). Also note how the original values pushed to stack are restores in RESTORE_* macros. (For now, ignore the code that gets emitted into other sections (between ".section .fixup,"ax";" and ".previous), as it is not meant to run on a normal system call path.) Note that %ds and %es are restored, too. Note the all-important IRET instruction that returns control to userland, restoring (by popping them from the stack) not only the EIP back to the place in userland program right past the "int 0x80" that caused the system call to go into the kernel, but also the CS and EFLAGS registers. Read about the IRET instruction in the Intel manual. 3. OpenSolaris/Illumos Syscall numbers exposed in Solaris in: /etc/name_to_sysnum Syscall numbers defined in: /usr/src/uts/common/sys/syscall.h (http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/syscall.h) Syscalls dispatched in: /usr/src/uts/intel/ia32/os/syscall.c http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/syscall.c Observe: dosyscall() gets the address of the requested syscall function by "code" in syscall_entry() then executes it by function pointer (lines 896--898). System call table: usr/src/uts/common/os/sysent.c http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/sysent.c Observe: Line 439 and below, struct sysent sysent[NSYSCALL] = ... 4. A simple syscall: getpid() http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/getpid.c#42 Looks up the PID via the pointer to the current thread descriptor curthread (follows the pointer to the process structure of type proc_t, then locates the integer PID value through that -- see my MDB session in proc_t-kernel-view.txt) Kernel struct that keeps process data (alongside with some others, explained briefly on pp. 44--48 of the textbook, details in Section 2.4): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/proc.h#130 Suggestion: explore proc_t for the linking between process structs. How many other proc_t's are linked to it and why? (Many...) 5. Tracing a program's system calls. "truss" on OpenSolaris, "strace" on Linux trace all the syscalls made by an application. Try: truss echo Hello The tracing starts from the execve call that loads the binary for the echo command in the shell's path (/usr/gnu/bin/echo) and the subsequent mmap calls that load the segments od code and data from that binary. This binary is in the ELF format (cf. the output of "file /usr/gnu/bin/echo"). Before the code can run, all the necessary components of that dynamically linked file must be loaded (mmap-ed), such as the dynamic linker-loader itself (/usr/lib/ld.so.1), and the libc library (cf. "ldd /usr/gnu/bin/echo"). These dependencies are described in the EFL file format, and are parsed out of it by the kernel's binary format handler. Observe that all but the last five of these syscalls are for "setting up" so that the write() syscall can finally do the echo's job. 6. Using the Modular Debugger (MDB) to examine kernel state "mdb -k" launches the debugger and "attaches" it to the running kernel (just as "mbd -p " would attach it to a running user process). The debugger has a somewhat different command syntax from GDB. Its ::help command is a useful entry point. Here is a (larger) tutorial: mdb-reference-chapter.pdf In a hurry, skip the historical intro, go straight to examples of commands. Tip: You can pipe debugger commands' output through grep when there is too much of it, rather than dealing with the internal pager. E.g.: ::dcmds !grep module to catch all commands with names or descriptions that contain "module". To see the address (in hex) of a symbol (function, variable, or any other thing that the debugger knows the address of): =K (or =X on 32bit machines) E.g.: getpid=K -- the address of the getpid function as a 64bit hex number getpid=X -- the address of the getpid function as a 32bit hex number (on 64bit machine, this will just give you the lower part of the address, and no warning that the address is incomplete) To see the contents (in hex) of memory at that address, use / rather than = getpid/X -- the opcodes at the start of the getpid function as one 32bit hex number (little endian) getpid/4B -- the same, as 4 separate bytes, in order getpid/4i -- the opcodes disassembled into instructions (stops at first 4) More about formats: ::formats (e.g., "::formats !grep hex") getpid::dis -- disassembles the whole function ==== Useful commands to look up: ::ps ::pstree ::objects ::print -t "struct proc" -- print the definition of a data type ::print -a "struct proc" -- print the definition of a struct type with hex offsets of each member More info on this is in Chapter 2.4 ("Process Structures"). This is what ps and other process reporting utilities extract; we are going to see how. 7. Reading kernel code The OpenGrok code browsing system is at http://src.illumos.org/source/ Kernel code "lives" under project "illumos-gate", under the path /illumos-gate/usr/src/uts/ (note UTS, which stands for Unix Time-Sharing, a very legacy name) Before we start reading kernel code in earnest, here are some idioms. ==== Functions defined in assembly ==== The ENTRY_* macros create function symbols that the linker will treat as normal C functions (when C functions are compiled into assembly, similar assembly is actually generated for them, too): http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/asm_linkage.h#210 #define ENTRY_NP(x) \ .text; \ <--- place in .text (code) segment .align ASM_ENTRY_ALIGN; \ <--- align at 4 byte boundary .globl x; \ <--- make macro's arg a global symbol.. .type x, @function; \ <--- of type "function" x: <--- here it starts... ==== Getting pointer to current thread ==== http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/asm/thread.h -- thread pointer context: extern __inline__ struct _kthread *threadp(void) { void *__value; #if defined(__amd64) __asm__ __volatile__( "movq %%gs:0x18,%0" /* CPU_THREAD */ : "=r" (__value)); #elif defined(__i386) __asm__ __volatile__( "movl %%gs:0x10,%0" /* CPU_THREAD */ : "=r" (__value)); #else #error "port me" #endif return (__value); } For explanations of the __asm__ embedding of Assembly into gcc C code, see http://www.ibm.com/developerworks/library/l-ia.html, or http://www.cs.virginia.edu/~clc5q/gcc-inline-asm.pdf (local copy: gcc-inline-asm.pdf) for more details. (Note: for functions that include assembly, the kernel contains a "__lint" version of the code that does not actually get built but keeps the compiler in checking ("lint") mode happy. For more info see manpage of "lint"). (Many macros in /illumos-gate/usr/src/uts/common/sys/thread.h are nice and readable; the key point is threadp(), which is CPU-dependent.) For explanations of "extern __inline__" see (**). Here is how it is used: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/thread.h#528 extern kthread_t *threadp(void); /* inline, returns thread pointer */ #define curthread (threadp()) /* current thread pointer */ #define curproc (ttoproc(curthread)) /* current process pointer */ #define curproj (ttoproj(curthread)) /* current project pointer */ #define curzone (curproc->p_zone) /* current zone pointer */ cf: in getpid() code: int64_t getpid(void) { rval_t r; proc_t *p; p = ttoproc(curthread); <--- will access local thread storage off %gs system call will make sure %gs segment selector is right for the process on behalf of which the system call is made, i.e., points to the right proc_t . r.r_val1 = p->p_pid; if (p->p_flag & SZONETOP) r.r_val2 = curproc->p_zone->zone_zsched->p_pid; else r.r_val2 = p->p_ppid; return (r.r_vals); } ================================================================ (**) extern __inline__ explained: http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp?topic=/com.ibm.xlcpp9.bg.doc/language_ref/cplr243.htm -- "If you specify the __inline__ keyword, with the trailing underscores, the compiler uses the GNU C semantics for inline functions. In contrast to the C99 semantics, a function defined as __inline__ provides an external definition only; a function defined as static __inline__ provides an inline definition with internal linkage (as in C99); and a function defined as extern __inline__, when compiled with optimization enabled, allows the co-existence of an inline and external definition of the same function. For more information on the GNU C implementation of inline functions, see the GCC documentation, available at http://gcc.gnu.org/onlinedocs/." Why all this? See http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#2402 -- a different definition in another file, and yet no linking problem)