Start reading the textbook!

In Chapter 1: Skim through:

   1.4 (to understand "lwp" in proc_t member names)
   1.7 (keep in mind that we look at x86, not SPARC!
        for x86 details read the execellent blogposts by Gustavo Duarte:
	http://duartes.org/gustavo/blog/post/memory-translation-and-segmentation
        http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection
        http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory)
	
  Chapter 2:  As you read, keep looking at proc_t definition
              and try commands from 2.13 in "mdb -k" (you must be root to run it!)
	   
   Read carefully:
   2.1 - 2.5 (especially 2.4 and Fig. 2.3)
               
   The figures are the best part of the book,
   spend time understanding them and following
   the links/pointers in code or actual memory           
   (with mdb -k)

   2.8 (again, keep in mind we are on x86)

   You can optionally start reading the following:
   2.10 - 2.10.1 (the /proc filesystem, 
          	  keep looking at the code in 
		  http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/proc/prvnops.c and Fig. 2.10)

1. System calls as the centerpiece of a Unix kernel. 
 
All privileged operations in Unix are performed on behalf of user
processes by "system call" code located in the kernel.  The data that
this code operates on is also located in the kernel and can only be
directly accessed when the CPU is in "kernel mode". This ensures that
user processes get to use this code only as a "package deal", with the
up-front permission and sanity checks being a part of the
package. This mechanism is the basis of the OS stability and security.

2. Some Linux details:

User-level code accesses syscall code through the so-called "call gate" 
mechanism: it sets the number of the desired call in a register
(EAX on Linux/x86), sets arguments or pointers to arguments
in other registers (EBX, ECX, EDX, ... on Linux 32bits) and executes
the "int 0x80" instruction (older 32bit systems), or "syscall" or "sysenter"
instructions (newer and 64bit systems). Note that the system call function
is accessed only by it number, not by its address, which user-level
code cannot "jump" or "call" to (if it tries, a segfault occurs). 

The "int 0x80" instruction simultaneously puts the CPU into the kernel
mode ("ring 0") and transfers control to the address stored in the 0x80-th slot
of the x86 CPU's Interrupt Descriptor Table (which is pointed to
by the CPU's special IDTR register). That address is *the single
entry point* for all system calls. 

Look at the nice Fig. 1 in this IBM developer article on syscalls:
http://www.ibm.com/developerworks/linux/library/l-system-calls/

Look at ENTRY(system_call) in an older Linux kernel: 
http://www.cs.dartmouth.edu/~sergey/cs108/rootkits/entry.S

Note the "call *sys_call_table(,%eax,4)" intruction. According to the Linux 
syscall calling convention, the system call number is passed in EAX, and 4
is the pointer length in 32 bit systems, so this just calls the implementation
of the system call through its function pointer in the EAX-th slot of the
sys_call_table, which is a table of function pointers.

In the above file, the common entry point for system calls is the 
ENTRY(system_call) at line 241. (The ENTRY macro creates a linkable symbol 
for the linker to pick up later; you can see this symbol in your /boot/System.map file,
which contains all public kernel symbols.) 

Note the saving of all userland registers by the SAVE_ALL macro, and
the switch to a fixed value __USER_DS in the data segment selectors
(%ds, %es).  Also note how the original values pushed to stack are
restores in RESTORE_* macros. (For now, ignore the code that gets
emitted into other sections (between ".section .fixup,"ax";" and
".previous), as it is not meant to run on a normal system call path.)
Note that %ds and %es are restored, too.  Note the all-important IRET
instruction that returns control to userland, restoring (by popping
them from the stack) not only the EIP back to the place in userland
program right past the "int 0x80" that caused the system call to go
into the kernel, but also the CS and EFLAGS registers. Read about the
IRET instruction in the Intel manual.


3. OpenSolaris/Illumos 

Syscall numbers exposed in Solaris in:  /etc/name_to_sysnum

Syscall numbers defined in: /usr/src/uts/common/sys/syscall.h 
(http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/syscall.h)

Syscalls dispatched in: /usr/src/uts/intel/ia32/os/syscall.c
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/os/syscall.c

Observe: dosyscall() gets the address of the requested syscall
         function by "code" in syscall_entry() then executes
	 it by function pointer (lines 896--898).


System call table: usr/src/uts/common/os/sysent.c
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/sysent.c

Observe:    Line 439 and below,  struct sysent sysent[NSYSCALL] = ...


4. A simple syscall: getpid()

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/syscall/getpid.c#42

Looks up the PID via the pointer to the current thread descriptor curthread
(follows the pointer to the process structure of type proc_t, then
locates the integer PID value through that -- see my MDB session in 
proc_t-kernel-view.txt)

Kernel struct that keeps process data (alongside with some others,
explained briefly on pp. 44--48 of the textbook, details in Section 2.4):
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/proc.h#130

Suggestion: explore proc_t for the linking between process structs. How
many other proc_t's are linked to it and why? (Many...)

5. Tracing a program's system calls.

"truss" on OpenSolaris, "strace" on Linux trace all the syscalls made
by an application.

Try:
truss echo Hello

The tracing starts from the execve call that loads the binary for
the echo command in the shell's path (/usr/gnu/bin/echo) and the
subsequent mmap calls that load the segments od code and data
from that binary. This binary is in the ELF format (cf. the output
of "file /usr/gnu/bin/echo").

Before the code can run, all the necessary components of that
dynamically linked file must be loaded (mmap-ed), such as the dynamic
linker-loader itself (/usr/lib/ld.so.1), and the libc library
(cf. "ldd /usr/gnu/bin/echo"). These dependencies are described
in the EFL file format, and are parsed out of it by the kernel's
binary format handler.  

Observe that all but the last five of these syscalls are for "setting
up" so that the write() syscall can finally do the echo's job.

6. Using the Modular Debugger (MDB) to examine kernel state 

"mdb -k" launches the debugger and "attaches" it to the running 
kernel (just as "mbd -p <process id>" would attach it to a running
user process).

The debugger has a somewhat different command syntax from GDB.
Its  ::help  command is a useful entry point.

Here is a (larger) tutorial:  mdb-reference-chapter.pdf
In a hurry, skip the historical intro, go straight to examples of commands.

Tip: You can pipe debugger commands' output through grep when
there is too much of it, rather than dealing with the internal pager.
E.g.:  
      ::dcmds !grep module 
to catch all commands with names or descriptions that contain "module".

To see the address (in hex) of a symbol (function, variable, or any other
thing that the debugger knows the address of):

<symbol>=K  (or =X on 32bit machines) 

E.g.: 

getpid=K   -- the address of the getpid function as a 64bit hex number
getpid=X   -- the address of the getpid function as a 32bit hex number 
	       (on 64bit machine, this will just give you the lower part of
	        the address, and no warning that the address is incomplete)

To see the contents (in hex) of memory at that address, use / rather than =

getpid/X   -- the opcodes at the start of the getpid function as one 32bit
	      hex number (little endian)

getpid/4B  -- the same, as 4 separate bytes, in order

getpid/4i  -- the opcodes disassembled into instructions (stops at first 4)

More about formats: ::formats  (e.g., "::formats !grep hex")

getpid::dis -- disassembles the whole function

====

Useful commands to look up:

::ps

::pstree

::objects

::print -t "struct proc"  -- print the definition of a data type 

::print -a "struct proc"  -- print the definition of a struct type 
	   	   	     with hex offsets of each member 

More info on this is in Chapter 2.4 ("Process Structures").
This is what ps and other process reporting utilities extract;
we are going to see how.

7. Reading kernel code

The OpenGrok code browsing system is at  http://src.illumos.org/source/ 

Kernel code "lives" under project "illumos-gate", 
under the path  /illumos-gate/usr/src/uts/   (note UTS, which stands
for Unix Time-Sharing, a very legacy name)

Before we start reading kernel code in earnest, here are 
some idioms.

==== Functions defined in assembly ====

The ENTRY_* macros create function symbols that the linker will
treat as normal C functions (when C functions are compiled into
assembly, similar assembly is actually generated for them, too):

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/asm_linkage.h#210

#define  ENTRY_NP(x) \
 	 .text; \			<--- place in .text (code) segment
	 .align	ASM_ENTRY_ALIGN; \      <--- align at 4 byte boundary 
 	 .globl	x; \		 	<--- make macro's arg a global symbol..
 	 .type	x, @function; \         <---  of type "function"
x:	 	   	      		<--- here it starts...  

==== Getting pointer to current thread ====

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/asm/thread.h -- thread pointer context:

extern __inline__ struct _kthread *threadp(void)
{
	void *__value;
 
#if defined(__amd64)
    __asm__ __volatile__(
        "movq %%gs:0x18,%0"		/* CPU_THREAD */
 	    : "=r" (__value));
#elif defined(__i386)
      __asm__ __volatile__(
          "movl %%gs:0x10,%0"		/* CPU_THREAD */
 	      : "=r" (__value));
#else
#error	"port me"
#endif
	return (__value);
}

For explanations of the __asm__ embedding of Assembly into gcc 
C code, see http://www.ibm.com/developerworks/library/l-ia.html, or
http://www.cs.virginia.edu/~clc5q/gcc-inline-asm.pdf  (local copy: gcc-inline-asm.pdf)
for more details.

    (Note: for functions that include assembly, the kernel 
           contains a "__lint" version of the code that does
      	   not actually get built but keeps the compiler
	   in checking ("lint") mode happy. For more info
	   see manpage of "lint").

    (Many macros in /illumos-gate/usr/src/uts/common/sys/thread.h
           are nice and readable; the key point is threadp(),
	   which is CPU-dependent.)

For explanations of "extern __inline__" see (**).

Here is how it is used:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/sys/thread.h#528

extern	kthread_t	*threadp(void);	      /* inline, returns thread pointer */
#define	curthread	(threadp())	      /* current thread pointer */
#define	curproc	        (ttoproc(curthread))  /* current process pointer */
#define	curproj		(ttoproj(curthread))  /* current project pointer */
#define	curzone		(curproc->p_zone)     /* current zone pointer */

cf: in getpid() code:

int64_t
getpid(void)
{
	rval_t	r;
 	proc_t	*p;
 
	p = ttoproc(curthread);    <--- will access local thread storage off %gs
	    			   	system call will make sure %gs segment 
					selector is right for the process on
					behalf of which the system call is made,
					i.e., points to the right proc_t .
 	r.r_val1 = p->p_pid;
 	if (p->p_flag & SZONETOP)
 	   r.r_val2 = curproc->p_zone->zone_zsched->p_pid;
   	else
	   r.r_val2 = p->p_ppid;
	return (r.r_vals);
}

================================================================

(**) extern __inline__ explained:
http://publib.boulder.ibm.com/infocenter/compbgpl/v9v111/index.jsp?topic=/com.ibm.xlcpp9.bg.doc/language_ref/cplr243.htm --

"If you specify the __inline__ keyword, with the trailing underscores,
the compiler uses the GNU C semantics for inline functions. In
contrast to the C99 semantics, a function defined as __inline__
provides an external definition only; a function defined as static
__inline__ provides an inline definition with internal linkage (as in
C99); and a function defined as extern __inline__, when compiled with
optimization enabled, allows the co-existence of an inline and
external definition of the same function. For more information on the
GNU C implementation of inline functions, see the GCC documentation,
available at http://gcc.gnu.org/onlinedocs/." 

Why all this? See
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#2402 
 -- a different definition in another file, and yet no linking problem)