This is not an exhaustive walkthrough of the lock implementation logic on OpenSolaris; expect a series of starting points for your own code reading, not full enlightenment (this time). In a pre-emptable kernel, spinlocks are the only choice for drivers and other critical operations that cannot block. See chapter 17 for explanations. For everything else there are adaptive locks, which choose to either block or spin, based on a simple heuristic: if the holder of the lock is running on another CPU, it won't hold the lock for long, and it makes sense to spin; if the holder is blocked, than blocking is the only option, because we don't know how long the owner will remain blocked. Spinlocks on multiprocessor machines ultimately depend on cache coherency logic to be efficient. Since locking the memory bus on every check of a state-holding memory byte would likely be too wasteful, and the results of non-locked reads cannot be relied on to persist even until the next instruction in the current thread, implementors of locks must deal with the case when the lock is snatched away just after its state was read and found "free" (or convince themselves that it can never happen on a particular platform). Keep this in mind when reading the code. A few details in preparation. Getting the CPU a thread is running on (recall that %gs is used to hold "per-thread context" data): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s ENTRY(curcpup) movl %gs:CPU_SELF, %eax ret #define CPU (curcpup()) /* Pointer to current CPU */ See also the definition of cpu_t : http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/cpuvar.h Clearing interrupts: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#intr_clear ENTRY(intr_clear) ENTRY(clear_int_flag) pushfl popl %eax CLI(%edx) /* just cli */ ret http://linux.derkeiler.com/Newsgroups/comp.os.linux.development.system/2004-03/0349.html -- about the cli instruction When restoring interrupts, "sti" (the opposite of "cli") may not be what we want -- instead, we typically want to restore the previous state of the interrupt flag, whether cleared or set (think stacks of saved state rather than one bit of state). Observe this logic in the "slrp" function, http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/intr.c#1142 In case you have trouble locating the definition of return_instr(): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#4097 ENTRY_NP(return_instr) rep; ret /* use 2 byte instruction when branch target */ /* AMD Software Optimization Guide - Section 6.2 */ In short, it's about branch prediction: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF ---- Spin lock implementation: lock_set_spl(). The extra logic as compared to Linux spin locks is due to awareness of the interrupts, and emulation of missed software interrupts (see sprl()). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/lock_prim.s#273 The i386 legacy XCHG instruction _implies_ bus locking: http://www.itis.mn.it/linux/quarta/x86/xchg.htm Read further down the mutex_enter_vector() function, to see where the spinning vs blocking on a queue ("turnstile", in OpenSolaris terms) decision is made, and how provisions are made for the actual state just checked changing just before the next step of setting up the lock. Cf. 17.5.2 in the book. We will be looking some more into interrupt privilege levels and their treatment by different CPUs. =================================================================================== OpenSolaris Boot Process. Sun's x86 ASM reference: http://docs.sun.com/app/docs/doc/802-1948/ To understand the x86 boot process, make sure you read the "Solaris Internals on x86" about the GDT, IDT (p 112--, also explained below with "mdb -k" examples) and related segments and gates. Then start reading the code at dboot_grub.s at code_start: (you will have to frequently refer to intel/sys/segments.h for definitions of data structures and macros that format and fill their entries, e.g., SEL_GDT). In the segments.h#84 notice the __xpv #ifdefs. The assignment of x86 code privilege levels (CPL) is different with and without the hypervisor present. The hypervisor must receive the CPL of 0, whereas the kernel code will receive CPL of 1 and user code CPL of 3. Otherwise, the kernel should have the CPL of 0. The SEL_KPL constant sets the kernel CPL level and later is used to form the segment descriptors. Note that CPL, DPL and RPL in their respective registers and structures together constitute the "Rings" 0--3 . The "rings" are an abstraction, implemented through these bit fields and memory segmentation logic. Line 136: Intel processors after Pentium diverged in features they support (on a Linux machine, do "cat /proc/cpuinfo" and grep for "flags"). These features are configured through "Model-Specific Registers" (MSRs) with RDMSR and WRMSR instructions. (For some fascinating history, see http://www.x86.org/articles/p5msr/pentiummsrs.htm) Line 148: See page translation via page tables set up. top_page_table is already prepared at this point by startup_kernel() in dboot_startkern.c , called from #110. Check where top_page_table gets created and filled: http://src.opensolaris.org/source/search?refs=top_page_table&project=%2Fonnv Ignore Xen's __xpv ifdefs, as before, and look for "else" bare-metal code. Line 164: This is where the paging mechanism really gets activated, by setting the CR0_PG bit in the %cr0 register. Line 189: Actual switch to 64 bit mode, for 64-bit extention platforms. TO BE CONTINUED... (see also "What happens next" below) ============== IDT and GDT in an x86_64 bit system ================ The Global Descriptor Table (GDT) is referred to on every address translation (segment lookup phase, before virtual address translation via %cr3 and page tables). Interrupt Descriptor Table (IDT) is involved in the CPU distpatching every trap, interrupt, and system call. These tables *must be filled* before the processor can switch to "protected mode", i.e., become a modern 32/64 bit CPU with virtual memory support and all the other nice features, from a legacy slow dumb 16-bit processor that used to run DOS (and still can), which is how it awakens on power-up and starts into the boot process. I am kind of like that myself in the mornings, except I can't run DOS. The boot setup goes something like this: _start in dboot_grub.s -> _locore_start in locore.s -> mlsetup() in mlsetup.c -> init_desctbls() after mlsetup() _locore_start calls main(), which never returns, and begins machine-independent system initialization (http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/main.c#101). See comment in uts/common/os/main.c at line 101. There is also a Local Descriptor Table, for per-process use, but it is not widely used. Somewhere, in a lonely mountain cave high above the world, there is a hermit who knows and understand all of Intel's hardware features, but the way there is long, hard, and only opens itself to the truly enlightened :-) These tables contain not bare addresses, but *descriptors* (which contain an address -- for some reason broken in two halves, lower and higher -- and then some extra information, such as permission bits). Note that the segment registers CS, DS, ES, FS, GS, and SS that "invisibly" participate in each instruction fetch (CS) and data access (DS for ordinary MOVs, SS for stack operations like PUSH and POP, ES and FS for string operations like MOVS** and STOS) contain not bare addresses where the respective segments start, but 16-bit *selectors*, which consist of a 13-byte index into the GDT (or LDT) and 2 tiny bit fields. There tables, stored in RAM but most likely cached by the processor, are found through two special registers, GDTR and IDTR. See "Solaris Internals on x86" page 112 for explanations of the contents of the GDT and IDT. NB: Solaris defines pointer variables gdt0 and idt0 that point to GDT and IDT: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/os/desctbls.c#99 Here they are (gdt0 and idt0 are laid out next to each other) > gdt0/4K gdt0: gdt0: fffffffffb7fe000 fffffffffb7fd000 0 0 ^^^ ptr to GDT ^^^ ptr to IDT These variables contain the same addresses as GDTR and IDTR point to. I found no direct way of reading these registers directly from "mdb -k"; however, since these addresses get set on boot and do not change, "gdt0" and "idt0" are sufficient. Observe also the code for the functions that read and write GDTR and IDTR, rd_gdtr, wr_gdtr, rd_idtr, wr_idtr (just ::dis at these symbols, results shown below). Observe them being used by init_gdt and init_idt functions in desctbls.c , themselves called by init_desctbls(). In case you are wondering what all these #ifdefs for "xpv" are, "i86xpv" is the name of the x86 platform modified to run on Xen, apperently for "Xen para-virtualization"; "i86pc" is bare metal or emulated x86. Observe the data structures reflecting the format of these table entries and related registers in http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/sys/segments.h In particular: user_desc_t and gate_desc_t To wit, the first entry in the GDT: > ::print -t user_desc_t { unsigned long usd_lolimit :16 unsigned long usd_lobase :16 unsigned long usd_midbase :8 unsigned long usd_type :5 unsigned long usd_dpl :2 unsigned long usd_p :1 unsigned long usd_hilimit :4 unsigned long usd_avl :1 unsigned long usd_long :1 unsigned long usd_def32 :1 unsigned long usd_gran :1 unsigned long usd_hibase :8 } > gdt0::print user_desc_t { usd_lolimit = 0xe000 usd_lobase = 0xfb7f usd_midbase = 0xff usd_type = 0x1f usd_dpl = 0x3 usd_p = 0x1 usd_hilimit = 0xf usd_avl = 0x1 usd_long = 0x1 usd_def32 = 0x1 usd_gran = 0x1 usd_hibase = 0xff } > ::sizeof user_desc_t sizeof (user_desc_t) = 8 > ::sizeof gate_desc_t sizeof (gate_desc_t) = 0x10 > *idt0/16K 0xfffffffffb7fd000: fb848e780030b7e0 fefffffbffffffff fb848e580030b7f0 fdff7f77ffffffff fb848ee80030b8d0 7fffeebfffffffff fb84ee200030baa0 7fbfefffffffffff fb84eeb80030bac0 fcfffeffffffffff fb848e680030bad0 ffffffefffffffff fb848ef80030bae0 7dbfffffffffffff fb848ec80030be80 cfccf7ffffffffff > *idt0 0xfffffffffb7fd000: fb848e780030b7e0 fefffffbffffffff fb848e580030b7f0 fdff7f77ffffffff fb848ee80030b8d0 7fffeebfffffffff fb84ee200030baa0 7fbfefffffffffff fb84eeb80030bac0 fcfffeffffffffff fb848e680030bad0 ffffffefffffffff fb848ef80030bae0 7dbfffffffffffff fb848ec80030be80 cfccf7ffffffffff > *idt0::gate_desc HANDLER SEL DPL P TYP IST div0trap 30 0 + int 0 > div0trap=K fffffffffb84b7e0 Observe how the address of the zero-division trap handler div0trap is packed inside the 16-byte descriptor at *gdt0 address: fb848e780030b7e0 fefffffbffffffff --> fffffffffb84b7e0 ^^^^ ^^^^ ^^^^^^^^ underlined, in order, left to right: sgd_hioffset :16 (bits), sgd_looffset :16 (bits), sgd_hi64offset :32 [Yes, in this order. Little-endian format BITES] > ::print -t gate_desc_t { unsigned long sgd_looffset :16 unsigned long sgd_selector :16 unsigned long sgd_ist :3 unsigned long sgd_resv1 :5 unsigned long sgd_type :5 unsigned long sgd_dpl :2 unsigned long sgd_p :1 unsigned long sgd_hioffset :16 unsigned long sgd_hi64offset :32 unsigned long sgd_resv2 :8 unsigned long sgd_zero :5 unsigned long sgd_resv3 :19 } Just in case you want to make sure: > *idt0::print gate_desc_t { sgd_looffset = 0xb7e0 sgd_selector = 0x30 sgd_ist = 0 sgd_resv1 = 0xf sgd_type = 0xe sgd_dpl = 0 sgd_p = 0x1 sgd_hioffset = 0xfb84 sgd_hi64offset = 0xffffffff sgd_resv2 = 0xfb sgd_zero = 0x1f sgd_resv3 = 0x7f7ff } To make sense of the table entries you will need to read the "Solaris Internals on x86", p. 112-- (**) http://pdos.csail.mit.edu/6.828/2006/readings/i386/MOVS.htm has a nicely formatted HTML version of an Intel's 32 bit manual. Since then, with 64 bit extensions, the logic got a bit more complex :-) DTrace backend implementation on Solaris x86 http://www.opensolaris.org/os/project/czosug/events_archive/czosug2_dtrace_x86.pdf ====================== IDT ============================ > $e !grep idt0 idt0(fffffffffbc2c488): fffffffffb7fd000 > $e !grep _idt rd_idtr(fffffffffb84b740): 90666666c30f010f kdi_idt_write(fffffffffb829368): 10ec8348ec8b4855 kdi_idt(fffffffffbc2cc50): 0 kdi_idt_switch(fffffffffb824188): 8ec8348ec8b4855 kdi_idt_sync(fffffffffb823f68): 30bfec8b4855 kdi_idt_init(fffffffffb823b28): 8ec8348ec8b4855 kdi_idt_patch(fffffffffb823bd8): 10ec8348ec8b4855 kdi_idtr_set(fffffffffb845038): 10ec8348ec8b4855 wr_idtr(fffffffffb84b750): 90666666c31f010f > rd_idtr::dis rd_idtr: sidt (%rdi) rd_idtr+3: ret > wr_idtr::dis wr_idtr: lidt (%rdi) wr_idtr+3: ret > ::idt HANDLER SEL DPL P TYP IST 0: div0trap 30 0 + int 0 1: dbgtrap 30 0 + int 0 2: nmiint 30 0 + int 0 3: brktrap 30 3 + int 0 4: ovflotrap 30 3 + int 0 5: boundstrap 30 0 + int 0 6: invoptrap 30 0 + int 0 ... BUT: where are these addresses? Why is =================== GDT ================================== > $e !grep gdt wr_gdtr(fffffffffb84b770): eb17010fe5894855 gdt0_default_r(fffffffffbc4f720): 0 rmp_gdt_init(fffffffffb82ebd8): 8ec8348ec8b4855 rd_gdtr(fffffffffb84b760): c907010fe5894855 gdt0(fffffffffbc2c480): fffffffffb7fe000 gdt_update_usegd(fffffffffb84a7f8): 10ec8348ec8b4855 kdi_gdt2gsbase(fffffffffb844f78): 8ec8348ec8b4855 cpu_get_gdt(fffffffffb850098): 48b4865ec8b4855 init_boot_gdt(fffffffffb84b578): 8ec8348ec8b4855 lx_set_gdt(fffffffff7fc92b8): 10ec8348ec8b4855 lx_clear_gdt(fffffffff7fc92d8): 8ec8348ec8b4855 > rd_gdtr::dis rd_gdtr: pushq %rbp rd_gdtr+1: movq %rsp,%rbp rd_gdtr+4: sgdt (%rdi) rd_gdtr+7: leave rd_gdtr+8: ret Some history on system calls in Solaris x86-64: http://blogs.sun.com/tpm/entry/solaris_10_on_x64_processors ============== What happens next? ================================= We have not explored this far, yet but: For 32-bit platforms, goto to entry_addr_low (should not return) For 64-bit platforms, patch that address to full 64 bit glory, and then go there. entry_addr_low is the lower 32-bit part of the target_kernel_text, which is set to platform-specific KERNEL_TEXT_{i386,amd64,i386_xpv} from i86pc/sys/machparam.h . On my system: #define KERNEL_TEXT_amd64 UINT64_C(0xfffffffffb800000) mdb> 0xfffffffffb800000::dis 0xfffffffffb800000: jmp +0x15883 <_start> 0xfffffffffb800005: nop 0xfffffffffb800009: nop 0xfffffffffb80000d: nop _locore_start: leaq +0x40cf49(%rip),%rbp <_edata> _locore_start+7: movq $0x0,0x0(%rbp) _locore_start+0xf: leaq +0x44667a(%rip),%rsp _locore_start+0x16: addq $0x4f10,%rsp _locore_start+0x1d: subq $0x8,%rsp _locore_start+0x21: movq %rdi,+0x42c9f0(%rip) _locore_start+0x28: movq %rdx,+0x403171(%rip) ... The _start label jumped to at KERNEL_TEXT_amd64 is in http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/fakebop.c, line 1583 (there is some confusion with other _start labels around the text, but this is what my mdb -k shows). The subsequent path: _start -> _kobj_boot in kobj_boot.c -> kobj_init in krtld/kobj.c -> ??? -> krltd is the Kernel Runtime Linker -> _locore_start in locore.s (the comment at that point reads, "XXX Make this less vile, please") We will look into kernel runtime linking later.