0. Tracing.

Linux provides strace(1) and ltrace(1) to trace system calls
and library function calls respectively. 

Strace is based on the ptrace(2) family of system calls. Read 
the manual page ("man 2 ptrace") for their description. See

http://lxr.linux.no/#linux+v2.6.11/kernel/ptrace.c  to get
the idea of ptrace building blocks, esp. 

http://lxr.linux.no/linux+*/kernel/ptrace.c#L120   (ptrace_attach)

and ptrace_readdata, ptrace_writedata .

Suggestion: strace several standard Linux commands such as ps and 
ls and interpret the results. Same for ltrace.

ltrace(1) uses a non-kernel -- but no less fundamental -- mechanism,
which is the basis of *dynamic linking*. We will discuss it in detail
at a later time.

  Linux   |  OpenSolaris
----------+--------------
  strace  |   truss
  ltrace  |   apptrace

1. Tracing ps 

ps on modern operating systems does little beyond reading /proc
and interpreting and pretty-printing its contents. The kernel
exposes its process control info in /proc's pseudo-files, and
it is the kernel's functions that walk the process control blocks
in response to your ps's "open" and "read" system calls.

The design which re-dispatches these general file-related system
calls to the appropriate worker functions is called VFS. Linux
uses a similar design (except Linux's inodes are the same thing
as Solaris' vnodes).

See Figure 1.5 on p. 31 and table 1.1 on p.32 for an overview
of VFS.

2. A /proc-based "ps" is a glorified "ls" + "cat"  

See what system calls ps makes with "truss ps".

The getdents64 function is what lists the contents of
a directory (this is what ls uses, too). The directory
read by ps is, of course, /proc. See "man getdents".

See procdents.d for the script that exposes kernel functions 
called in response to a ps commands' getdents call.   

Look at their code in the Solaris source browser.

We will read the revealed functions' code to see how they
walk the process list next time.

----

However, OpenSolaris provides a much more kernel-level
tracing framework: DTrace.

----

3. Exploring DTrace.

Unlike "truss <command>" which traces only system calls
by the process created by <command>, 
DTrace probes fire for events caused by all processes on the system,
unless limited by predicates. 

Hint: as a rule, there is some event that happens a lot more often than
others, and needs to be filtered out before the output becomes readable.

DTrace probes are ultimately function adresses within the kernel with 
argument 

Providers: 

syscall -- entry & return of all syscalls (argument knowledge)
proc    -- process creation and lifecycle; signal-related events
fbt     -- all kernel functions' boundaries
sdt     -- statically defined tracing: programmer-placed probes
sched   -- scheduler events that happen to threads 
io      -- I/O subsystem

pid     -- user-level functions in processes

vminfo  -- VM events (based on kstat)
sysinfo -- sys kstat kernel statistics

profile -- profiler, runs periodic "tick" actions (e.g., profile:::tick-5s)


D has its own built-in string type. Convert to it with stringof( char* )
for char* kernel strings. 

Argument strings located in userspace (such as syscall arguments), 
must be copied into kernel space before they can used, 
since all probe actions are executed within kernel space. 
Use "copyinstr" (see examples).

#pragma D option quiet 
  suppresses DTrace's own default print actions

#pragma D option flowindent 
  supplies indentation that follows entry and exit from functions

4. Examples.

Lots of examples found here:
http://www.brendangregg.com/DTrace/dtrace_oneliners.txt

Suggestion: work through the examples in 
http://www.brendangregg.com/DTrace/dtrace_oneliners.txt , using 
the DTrace Guide (http://docs.sun.com/app/docs/doc/817-6223)

http://www.sun.com/bigadmin/content/dtrace/ 
provides several DTrace tutorials, DTrace developer blogs 
and use cases.


NOTE: DTrace tries to provide its action blocks with variables and
      structs that are most convenient to work with. Which variables
      will be made accessible in the action block depends on the
      probes that match the  provider:module:probefunc:probename 
      expression. But:

      - if a variable is not defined for some matching probe, the block
      won't compile;

      - always check the DTrace Guide for the tables of variables
      available for specific providers and probes: this can save 
      a lot of time.

Observe the  curpsinfo  variable pointing to a special "struct psinfo_t"
filled with info about the "current" process (i.e., the process
that caused the probe to fire),
as described in "Table 25–1 proc Probes" at 
http://docs.sun.com/app/docs/doc/817-6223/chp-proc?a=view

Observe that this struct's  pr_psargs  member contains the string
of arguments to  ls  after the Bash shell expanded them.

---

5. Using the Modular Debugger (MDB) to examine kernel state 

"mdb -k" launches the debugger and "attaches" it to the running 
kernel (just as "mbd -p <process id>" would attach it to a running
use process).

The debugger has a somewhat different command syntax from GDB.
Its  ::help  command is a useful entry point.

Here is a (larger) tutorial:
http://learningsolaris.com/docs/chpt_mdb_os.pdf

Tip: You can pipe debugger commands' output through grep when
there is too much of it, rather than dealing with the internal pager.
E.g.:  
      ::dcmds !grep module 
to catch all commands with names or descriptions that contain "module".

To see the address (in hex) of a symbol (function, variable, or any other
thing that the debugger knows the address of):

<symbol>=X 

E.g.: 

getpid=X   -- the address of the getpid function as a 32bit hex number

To see the contents (in hex) of memory at that address, use / rather than =

getpid/X   -- the opcodes at the start of the getpid function as one 32bit
	      hex number (little endian)

getpid/4B  -- the same, as 4 separate bytes, in order

getpid/4i  -- the opcodes disassembled into instructions (stops at first 4)

More about formats: ::formats  (e.g., "::formats !grep hex")

getpid::dis -- disassembles the whole function

====

Useful commands to look up:

::objects

::ps

::print -t "struct proc"  -- print the definition of a data type 

e1ddb380::print proc_t    -- print the contents of memory at e1ddb380,
			     interpreting it as a proc_t struct type

(I got the above address from the ::ps  "walker" command that knows
 how to find and walk the process control block)

e1ddb380::print	 proc_t  p_ppid  -- print only the selected part of the 
		 	 	    data structure at that address

(remember to use !grep liberally if you don't like scrolling)

More info on this is in Chapter 2.4 ("Process Structures").
This is what ps and other process reporting utilities extract;
we are going to see how.

6. DTrace internals

Mdb can reveal some details of DTrace implementation
(see dtrace-internals-x86.pdf for more).

Install a probe into the running kernel:

# dtrace -n fbt::getpid:entry

Without quitting the trace, examine it in memory:

# mdb -k 
> getpid::dis

Quit the trace, the repeat the same "mdb -k" examination of the
same function. You will notice the "int 3" (opcode 0xCC) at the
preamble. DTrace installs its own INT 3 handler in the IDT, just like
a typical debugger, except that debuggers work in "ring 3", whereas
DTrace's fbt provider works fully within the kernel (except for
passing the output to the user-level dtrace(1) utility to print).

--TBC--