We looked some more at the boot-time set up of *kernel context*. This will be useful when we look into user contexts and context switching. x86 "rings" and segment registers that implement the abstraction of "rings" are explained in "x86 Opensolaris internals" Sections 4.1 (you may skip 4.1.3 -- or else look up the info about this mostly ignored hardware context-switching primitive in Intel's manuals. In modern OSes, TSS segments are only used to store kernel stack %esp values, or not at all). BTW: "dboot" is "direct boot", loadable from GRUB, the multi-OS boot environment that has become a free software standard. For a detailed discussion, see the thread at http://markmail.org/message/pjdp4lo5rckzefbr To wit: "To get from dboot into the rest of unix, dboot turns on virtual memory and jumps to the kernel entry point which is always just the first instruction at the load address of the kernel text. That's always i86pc/ml/locore.s - which starts with: /* * The very first thing in the kernel's text segment must be a jump * to the os/fakebop.c startup code. */ .text jmp _start You'll find _start written in C in usr/src/uts/i86pc/fakebop.c. From there on out it's mostly written in C and not too hard to follow." http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/dboot/dboot_grub.s -> code_start code_start -> startup_kernel in http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/dboot/dboot_startkern.c#1069 Observe the kernel querying the CPU for various supported features via the CPUID instruction wrapped in macros, and then setting up the kernel context page tables accordingly. Actual kernel page table setup in build_page_tables() is something of an anti-climax: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/dboot/dboot_startkern.c#997 However, chase down the actual machine page (ma) to virtual page translation, and observe various special cases: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/dboot/dboot_startkern.c#501 Of such things are actual OSes made. ------------------------------------------------------------------------ Side Note: The pragmatics of OS programming are in delicate balance with "extra" hardware features that keep getting added to the platform. This "dance" is ultimately about how humans manage complexity: some hardware features that seemed like a good idea go unused; otherwise get used far beyond expectations. "Tried and familiar" considerations and backward compatibility requirements factor in, too. The historical transformation of the role played by x86 segmentation and the more recent examples of x86-64 extensions and VM features like PAE are good examples food for thought here: x86-64: "x86 Solaris internals", 2.1, 2.3, 3.2 PAE: (see below) ------------------------------------------------------------------------ Chapter 8: Virtual Memory generics. * Isolation (illusion of -- cf. debugging support) * Page sharing between apps * Demand paging * VM as file I/O cache Suggestion: track page-ins of an mmap-ed file using the io::: and vminfo::: DTrace providers, and demonstrate demand paging and file caching. Probe names and descriptions actually go a long way towards explaining the OpenSolaris VM system. A taste of actual hardware: PAE: http://en.wikipedia.org/wiki/Physical_Address_Extension (36 bits vs classical 32 bits of address space, i.e. 4GB -> 64GB) Classic page translation without PAE: CR3 -> 4 KB "page directory" (4 byte entry)*1024 -------> 4 KB "page table" (4 byte entry)*1024 Page translation with PAE: (bit 5 of CR4 := 1) CR3 -> Page Dir Ptr Table (8 bit entry)*4 -------> 4 KB "page directory" (8 byte entry)*512 ----> 4 KB "page table" (8 byte entry)*512 bit 7 in each PDE eliminates the last lookup stage when set; instead, the rest of the address is interpreted as an offset into a 4MB (no PAE, 22 bits) or 2MB page (with PAE, 21 bits). Bit 0 in the PTE is the crucial "Page Present" bit. When hardware translation sees it, i raises the #PF trap, which swaps the page back in and retries the instruction (at the address stored in %cr2 on entry to #PF handler). BTW, the per-page NX bit is in the PAE layout. See "x86 OpenSolaris internals", Section 4.3. For the OS-developer level of documentation on PTEs and PDEs in detail: "Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide", pp. 3-35 -- 3-45 http://download.intel.com/design/processor/manuals/253668.pdf See also Section 3.12, p. 3-51 for a brief summary of TLBs. ------------------------------------------------------------------------ Unified OpenSolaris trap (faults and exceptions) handling: "x86 OpenSolaris internals", Section 6.2. Once the interrupt-specific stubs described on p.115 and then the further unifying _cmntrap() are done, we are back in C-land: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/trap.c All traps are handled uniformly by trap(). This is a conscious design decision: all registers are saved on the stack by the respective ASM interrupt handler pointed from the IDT, and then C routines are presented with the same data structure. Observe the "struct regs *rp" argument in trap, and also "caddr_t addr" (must be extracted from CR2 when pagefault handler is called). This is done by cmntrap ("common trap"), http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/ml/locore.s (which, in turn, starts with pushing all regs on the stack by INTR_PUSH: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/sys/privregs.h#155) Observe: type = rp->r_trapno; /* passed on stack by some, in regs by others */ Observe the use of pagefault() from http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/vm/vm_machdep.c TO BE CONTINUED... ------------------------------------------------------------------------ Chapter 9.2 explains address spaces. Read it carefully. Suggestion: Using DTrace's vminfo::: provider, observe all four kinds of page mappings in Fig. 9.2 (described on p. 457) in action for your favorite process. E.g., write simple "memory hog" programs to malloc() a lot of anonymous pages, or call functions with lots of stack-allocated local arrays. Observe file sharing between processes. Notice how *minor faults* are handled (see 9.4.4 for definitions) Chapter 9.4 explains how address spaces are implemented and handled. proc_t.p_as -> struct as -> (AVL Tree) -> "struct seg" Cf.: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/vm/seg.h as_alloc() [once, at system boot/init time] -> as_dup() [by fork()] See Table 9.3 for address space manipulation functions. ------------------------------------------------------------------------ Some extras: http://blogs.sun.com/JoeBonasera/entry/more_on_solaris_x86_and http://en.wikipedia.org/wiki/Translation_lookaside_buffer Observe the use of the INVLPG instruction in http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#426 ------------------------------------------------------------------------