==================== Kernel's Memory Space ==================== For the starting point of creating the kernel memory image: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/startup.c#670 Also peek at the constant definitions setting the platform limit of linear addresses, starting at line 189. These will be used to compute everything else in the address layout (see line 383 for an example layout). Note also lines 289--363 for the important kernel symbols: these variables get created here, will be referred to throughout other code as 'extern' declarations. (Notice that this is an x86 platform-specific startup; it's got to start platform-specific till higher level abstractions like Vmem can be used. Note that such abstractions still have to work around the so-called "memory hole", a range of 64bit addresses that cannot be used by most systems. You will find checks for memory hole even in high-level objects like AS: e.g., seg_alloc, the function that builds the segments of an address space, http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_seg.c#seg_alloc calls valid_va_range: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/vm/vm_machdep.c#valid_va_range ) The kernel is loaded into a set of fixed, platform-specific virtual address ranges. To re-iterate, all addresses are virtual, and so is any address that is encoded as a part of an instruction. Kernel (virtual) address layouts: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/startup.c#380 Within these kernel address ranges, address mappings are created and managed by several kernel "segment drivers": Kernel's segments (Ch. 11.1): seg_kmem -- normal non-pageable kernel memory management seg_kp -- kernel pageable memory management seg_map -- file cache pages mapped into kernel space (::addr2smap DCMD) seg_kpm -- all physical memory pages mapped into kernel space (only in x64) Read more about these in 11.1.5-6 . The Smap layer establishes a mapping from a virtual address to a "page identity" : http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg_map.h#75 In MDB, it is reported by the ::addr2smap command. An important observation about "large pages" is found on p. 530, 11.1.2 regarding the effect of the large pages on the kernel's efficiency. 10% improvement is a lot. Suggestion: examine 'kas' in kernel space, draw the full tree of kernel segments. Observe their different *_ops arrays, compare with the segment drivers above. =========== AVL trees, embedding & offsets =========== Address spaces ('struct as') of processes ('proc_t') -- as well as of the kernel's address space 'kas' -- are made up of segments ('struct seg'). These segments are arranged in an AVL tree (a balanced kind of a binary search tree) to make finding a segment for a faulting address efficient; see the definitions for 'avl_tree_t', 'avl_node_t', and their containing 'struct as' and 'struct seg' respectively. Note that the tree and node structure are contained rather than pointed to in the respective OS abstractions -- note the offset manipulation used for translation between embedded nodes and their containing data structures. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_as.c#as_findseg (note caching of the last looked up segment) hands off to http://src.illumos.org/source/xref/illumos-gate/usr/src/common/avl/avl.c#avl_find (essentially, a simple binary search) For every kernel-space data structure, "how is access to it is synchronized?" is essential (one cannot write any kernel code using a data structure without understanding the sync model of the data structure). So comment on line 26 explains the synch http://src.illumos.org/source/xref/illumos-gate/usr/src/common/avl/avl.c#26 ---- Kernel AS (kas) example ---- kas::print { a_contents = { _opaque = [ 0 ] } a_flags = 0 a_vbits = 0 a_cv = { _opaque = 0 } a_hat = 0xffffff0149476e78 a_hrm = 0 a_userlimit = 0 a_seglast = kmapseg a_lock = { _opaque = [ 0 ] } a_size = 0xff3814b000 a_lastgap = 0 a_lastgaphl = 0 a_segtree = { avl_root = kvseg+0x20 avl_compar = as_segcompar avl_offset = 0x20 avl_numnodes = 0x9 avl_size = 0x60 } ... Observe 9 segments in the kernel address space. At the root of the kernel AS tree is kvseg seg_t structure. > kvseg::print { s_base = 0xffffff0149400000 s_size = 0xfe76c00000 s_szc = 0 s_flags = 0 s_as = kas s_tree = { avl_child = [ kpseg+0x20, ktextseg+0x20 ] avl_pcb = 0 } s_ops = segkmem_ops s_data = kvps ... Abbreviated: > kvseg::seg SEG BASE SIZE DATA OPS fffffffffbc31530 ffffff0149400000 fe76c00000 fffffffffbceea30 segkmem_ops From the child nodes kpseg and ktextseg you can explore the full AVL tree of the kernel address space. [Do it! Note the different *_ops and s_data structs for segments -- their interplay makes up the "segment drivers" described in the textbook.] Observe the switching between the avl_node_t embedded into the respective seg_t structs (at 0x20, the a_segtree's avl_offset) and the actual seg_t objects. In avl.c it is provided by macros AVL_NODE2DATA and AVL_DATA2NODE. The implicit assumption is that in any avl_tree_t, the offsets involved are the same for all tree nodes; the same holds for the comparator function applied to tree nodes in avl.c. ============= Legacy BSD kernel memory allocator ============= Before we start on OpenSolaris' Vmem allocator, it will be instructive to look at the legacy BSD and the comparatively recent Linux kernel memory allocator interfaces. The legacy BSD "malloc" code has been made famous by the SCO case (in which SCO laid a claim to "intellectual property" in the Linux kernel): The story: http://www.lemis.com/grog/SCO/code-comparison.html The code: /* * Allocate 'size' units from the given * map. Return the base of the allocated space. * In a map, the addresses are increasing and the list is terminated by a 0 size. * The core map unit is 64 bytes; the swap map unit is 512 bytes. * Algorithm is first-fit. */ malloc(mp, size) struct map *mp; { register unsigned int a; register struct map *bp; for(bp=mp;bp->m_size && ((bp-mp) < MAPSIZ);bp++) { if (bp->m_size >= size) { a = bp->m_addr; bp->m_addr += size; if ((bp->m_size -= size) == 0) { do { bp++; (bp-1)->m_addr = bp->m_addr; } while ((bp-1)->m_size = bp->m_size); } return(a); } } return(0); } The sequence of "struct map"s traversed by incrementing bp acts as a free list in which the first chunk of size greater or equal than requested is found. Note that the "struct map" pointed by bp can be allocated both "in-band" or in a separate memory area. OpenSolaris chooses to allocate similar structures out-of-band, as explained in 11.3.4.1 ============= Linux generic kernel memory allocator API ============= Linux malloc with in-band boundary tags is explained in http://www.dent.med.uni-muenchen.de/~wmglo/malloc-slides.html The 2.6 Linux kernel memory allocator API is described here: http://www.linuxjournal.com/article/6930 Note that kmalloc() is a function shared by all of the kernel's non-slab allocations (slabs are handled differently and are closer to the OpenSolaris KMEM allocator (Ch. 11.2) without the extra "magazine" and "depot" layers. Flags from Table 4 in the above link determine whether this particular allocation can or cannot block, and also distinguish between several purposes of allocated memory. ============= OpenSolaris kernel's VMEM allocators ============= By contrast, OpenSolaris interfaces allow multiple named pools of memory with uniform properties per pool ("Slabs" aka "Kmem caches"; Vmem "arenas"). Essentially, a pool becomes a named object, in which allocations and deallocation functions become methods. Pools can be nested and configured to obtain new allocations from an enclosing pool object when necessary. The textbook stresses the generalized character of the VMEM allocator in Ch. 11.3, pp. 552--553. As described, VMEM allocates subranges of integers of requested size within the initial range allocated at system boot. The integers are primarily meant to be address ranges (in particular, nested), but can also be integer ID ranges. This is stressed by calling the allocated ranges "resources", not "addresses". Although the allocator includes some special functions that are address-aware (vmem_xalloc, in particular, controls address range "coloring" as in 10.2.7), they try to be as forgetful about the nature of the ranges as possible, and treat the allocation as a general algorithmic problem about allocating integer intervals economically. The initial range is ultimately derived either from the static per-platform kernel memory layout as in http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/startup.c#383 or from a fixed permissible range of IDs. Page 554 summarizes the VMEM interface, explained in pp. 555-560. Read it before we start looking at the actual Vmem code.