Hierarchies 3: balance, 2-3-4 and Red/Black Trees
We have seen that insertion, search, deletion, finding the max and min in a binary search tree can be done in time proportional to the height of the tree. However, as we also saw, a binary tree can have height O(n) in the worst case. This occurs, for example, when all items are inserted in increasing order. (There are many other such configurations, too). So, how do we insure that this doesn't happen?
Balanced Binary Search Trees
First idea: Each time you insert or delete, "fix up" the tree so that it is always as close to balanced as possible. Problem: This isn't as easy as it looks. Consider this tree:
This is balanced just fine. Now, insert 10. A perfectly balanced tree with these keys would look like this:
Note that not a single parent-child relationship is preserved between the original tree an the new, balanced one. So, it's possible that keeping maintaining balance could take O(n) time, on each update! Well, maybe not every update, but every second one. Consider deleting 70 and inserting 0, then deleting 60 and inserting -10, .... That's no good, since the whole point was to make the operations run fast.
Potential Solutions. We have a binary search tree, we could:
- Give up "binary" -- allow nodes to have varying numbers of children. Leads to 2-3-4 trees, B-trees.
- Give up "perfect" -- keep it "close" to perfectly balanced. Leads to AVL trees, Red-Black trees.
We will first look at 2-3-4 trees, and then relate them to a structure called red-black trees, which are easier to implement.
2-3-4 Trees
The book calls these (2,4) trees. Both names are used.
Intuition:
- Allow multiple keys to be stored at each node.
- A node will have one more child than it has keys:
- leftmost child -- all keys less than first key
- next child -- all keys between first and second keys
- ... etc ...
- last child -- all keys greater than the last key
Rules for 2-3-4 Trees
- Every node has either 2, 3, or 4 children (1, 2, or 3 keys).
- All the leaves of the tree (either external nodes or null pointers) are on the same level.
To insert, you first go as far down the tree as you can. So we build a tree by inserting 41, 38, 31:
In general, go down to a node with external children, and if a 2 or 3 node, expand to a 3 or 4 node, adding the new key in the right location.
If the node that you get down to is a 4-node, SPLIT it into two 2-nodes, promote the middle key to join the parent node (creating a new root if the current root is split), and then continue following down into the correct 2-node. Note that promoting the middle key to the parent may add a fourth key to the parent, which then has to split, etc.
Now let's insert 12. No room in the 4 node, so split:
Now let's insert 19: No problem, goes into the 3 node:
But now if we insert 8, no room in the 4 node, so split, promoting 19:
Now insert 15 (unproblematic), and 17 (one split):
Now insert 33 and 35 (unproblematic):
Now insert 20 -- drives a double split. We want to split 31:33:35, so we can insert 20, but we cannot, because 12:19:38 is already a 4-node (full). So first, we must split that one.
Problems with 2-3-4 trees
This is tricky to implement. You could:
- Have three different types of nodes, and keep creating new ones as you need to (complex to code, lots of allocations)
- Have each node big enough to hold a 4-node (wastes a lot of space)
The big advantage binary nodes have is that they're simple and small; you always need at least TWO children anyway, so you can keep track of what's on either side of a particular key. So how can we have the nice balance of 2-3-4 trees, but with only binary nodes?
Red-Black Trees
Around 1984, Guibas and Sedgewick came up with the following clever idea:
- Translate each 2, 3, or 4-node into a miniature binary tree.
- "Color" each vertex, so that we can tell which nodes belong together as part of a larger 2-3-4 tree node.
So, basically we have encoded each 2, 3, or 4 node using only binary nodes, but we've painted each node with a "color", either red or black. We paint a node red if it's part of a "bigger" node. This is called a red-black tree.
One of our 2-3-4 trees:
becomes a red-black tree:
So, now, one simple way to think about Red-Black trees is to just apply the 2-3-4 tree rules, bearing in mind this encoding.
Red-Black Properties
But, you can also think of a Red-Black tree as a structure separate from a 2-3-4 tree. The rules are:
- Every node is either RED or BLACK; by convention, the empty tree is a leaf, and is always considered BLACK.
- The root is black. If an action colors it red, change it back to black.
- For any node which is RED, both of its children are BLACK (this means: No two consecutive red nodes along any path)
- Every path from any node to a descendant leaf has the same number of black nodes.
Claim: The Red-Black properties (1-4) mean that the depth of the tree is O(lg(n)) if there are n nodes in the tree.
Informal justification:
- Since every path from the root to a leaf has the same number of black nodes (by property 4), the shortest possible path would be one which has NO red nodes in it.
- Suppose k is the number of black nodes along any path from the root to a leaf.
- How many red nodes could there be? At most k. By property 3, anytime you get a red node, your next node must be black. You can only do that k times before you run out of black nodes. So, the LONGEST possible path is at most 2 times the length of the shortest, or h ≤ 2k.
- It can be shown that if each path from the root to a leaf has k
black nodes, there must be AT LEAST 2^k - 1 nodes in the tree. (1
node at root, 2 nodes at level 1, 4 nodes at level 2, etc. Add them
up.) Since
h ≤ 2k, i.e. k ≥ h/2
there must be at least 2^(h/2) - 1 nodes in the tree. If there are n nodes in the tree, that means
n ≥ 2^(h/2) - 1
Adding 1 to both sides:
n + 1 ≥ 2^(h/2)
and taking the log (base 2) of both sides:
lg(n+1) ≥ h/2
2 lg(n + 1) ≥ h, which is O(lg(n)).
Thus the time complexity of the `lookup' operation is O(h), which we just argued is O(lg(n)) in the worst case.
Insertion
As usual for a binary search tree, find the location where the new element goes, and insert it there. Initially, color it RED. This insures that rules 1, 2 and 4 are preserved. But rule 3 might be violated (a red node has black children). There are several cases (assume we're inserting a key x).
- A 2-node (black node with black children)
| | | a ==> a or a (becomes a 3-node) / \ // \ / \\ x x This is not problematic; no violation of rule 3.
- A 4-node (black node with red children)
| a // \\ b c / \ / \ << x to be inserted at one of these locations Here, we have to "split" the node, promoting the middle element to a higher point in the tree. But that's easy: Just join (a) with its parent, and "unjoin" (b) and (c) from (a). This amounts to a "color flip", e.g.:
|| a / \ b c / \ / \\ x << wherever x is, it's okay We haven't changed the black height of anything; the new node is RED, and we just switched the order of red/black in the subtree. So we haven't broken the other properties, and now rule 3 is okay here. But we must make sure the property hasn't been violated at a higher point in the tree! We'll come back to this.
- A 3-node (black root with one red child).
This case is a bit trickier. It depends where inserting:
| | a a // \ or / \\ b <3> <3> b / \ / \ <1> <2> <2> <1> - If inserting at #3, no problem; basically creating
a 4-node, adding x at the front or the back. No color problem:
| | a a // \\ or // \\ << simple case, insert at <3> b x x b / \ / \ / \ / \ - Suppose insert at #1. Then, two red nodes in a straight line:
| | a a // \ or / \\ b <3> <3> b // \ / \\ x <2> <2> x Note in this case that x < b, and b < a. We could fix this operation by "rotating" this whole structure to move b up to the root of the structure, colored black with x on one side (colored red) and a on the other side (colored red):
| | b b // \\ or // \\ x a a x / \ / \ <2> <3> <3> <2> Note that we haven't changed the number of black nodes along any path by doing this, and we've fixed the double red. And sub-trees #2 and #3 are still ordered properly.
This is called a single rotation.
- Suppose insert at #2. Then have two red nodes in a row,
but in a "zig-zag" pattern, e.g.:
| | a a // \ or / \\ b <3> <3> b / \\ // \ <1> x x <1> Now, the "middle" element is x, so we basically want to move x up to the root, and have b and a on either side, both colored red:
| | x x // \\ or // \\ b a a b / \ / \ / \ / \ <1> <3> <3> <1> One way to think of this is to suppose that you do it as two single rotations -- once around b, and then around x. (Here it is shown for the left case only; the other one is symmetric:
| | | a a x // \ // \ // \\ b <3> ==> x <3> ==> b a / \\ // \ / \ / \ <1> x b <1> <3> / \ <1> This is called a double rotation. The resulting 4-node will be the same regardless of which 3-node version you started from.
The book combines both of the tricky cases into what they call a trinode restructuring. This captures the basic idea - you are turning the three nodes of an invalid 4-node into a nice, balanced, valid 4-node. This is the operation that prevents a tree from getting too vine-like. Note that everything under the bottom node moves up a level in the tree.
- If inserting at #3, no problem; basically creating
a 4-node, adding x at the front or the back. No color problem:
Go through all of the 2-3-4 trees and show how the operations act on red-black trees.
Note that even in the worst case, we only have to fix colors along the path from the new node to the root, so that's only a constant factor of additional work. Furthermore, it can be proved that we will never have to do more than one rotation or double-rotation to fix up the tree. All the other changes can be done with color-flips. So, insert is also O(lg(n)).
Deletion
Deleting from both these sorts of trees is a bit more tricky, but it doesn't increase the time complexity -- it can still be done in O(lg(n)) time, worst case. The additional complexity comes because there is one way of splitting nodes, but several ways of merging nodes together (which you sometimes have to do when deleting).
The key to deletion is to make it so that we just have to delete a node at the bottom of the tree. If the node is currently internal, we do basically the same thing that we did for binary search tree deletion: find the immediate predecessor or successor and use its key as a replacement for the one to be deleted. So now we're left with the problem of deleting the predecessor/successor at the bottom of the tree.
In 2-3-4 trees, if we are deleting a 3 or 4 node, we are done. In red-black trees this corresponds to a vertex that is red or whose child is red. Color the child black and we are done.
In the example below we delete 10 by replacing it with its immediate predecessor or successor (we chose predecesssor) just as we would do when deleting in a BST. Replacing 10 by 8 also requires us to delete 8 out of the 3-node that it currently belongs to (easy to do since the node is a 3-node). In the red-black tree, it has a red child that can easily replace it without changing the number of black nodes along that path:
The problem comes when deleting a 2-node. Call this vertex v. Deletion would leave a 1-node, which is not valid. In a red-black tree this creates a node that is double-black to indicate that this path is short one black edge, so this edge counts as two black edges. The cases are:
- If w is an adjacent sibling of v that is a 3 or 4 node, we adopt (steal?) a
child of w and give it to v. To maintain the order we have to move a key
from w up to their common parent and a key from the parent down to v.
In a red-black tree this corresponds to a trinode restructuring
or a slightly more complicated "adjustment". The book has the
details. So to delete 7:
8 12 / \ / \ 7 12:13 -> 8 13 / \ / | \ / \ / \ 8 12 / \ / \ 7 12 -> 8 13 / \ / \\ / \ / \ 13 / \ - If all of v's adjacent siblings are 2-nodes we fuse v with a sibling.
To maintain the order we need a key between them, which we pull down
from the parent. If the parent is a 3 or 4-node this is easy. If
not we have passed the underflow up to the parent node, which then
must be resolved at that level by adopting or fusing. The fusing
operation can be passed up the tree to the root. If the root loses
its only key it disappears and the tree is one level shorter.
In a red-black tree this results in a re-coloring that passes
the double black up to the parent, where it must be resolved.
Removing 3:
12 12 / \ / \ 7 20 -> _ 20 -> 12:20 / \ / \ | / \ / | \ 3 9 15 30 7:9 15 30 7:9 15 30 / \ / \ /\ /\ / | \ /\ /\ / | \ /\ /\ 12 12 12 / \ /// \ / \\ 7 20 -> 7 20 -> 7 20 / \ / \ / \\ / \ / \\ / \ 3 9 15 30 9 15 30 9 15 30 / \ / \ /\ /\ / \ /\ /\ / \ /\ /\
The ideas are straightforward, but the implementation can be a bit messy. The book covers it in detail.
Synopsis
Binary search trees are a good data structure for implementing the Map ADT, but their performance is best when they are kept in balance.
- Keeping "perfect" balance turns out to be too much work.
- But we can make it work if we either give up "binary" (have search trees with variable numbers of children), or give up "perfect" (allow trees to be "somewhat unbalanced")
- 2-3-4 trees: Use the first strategy.
- All leaves are at the same level, all paths the same length.
- But it is clumsy to implement due to variable-size nodes.
- Red-Black trees: Use the second strategy.
- Encode 2-3-4 nodes as "mini-trees", nodes colored red to indicate they are conjoined with their parent.
- Use "rotations" and "color flips" to keep the tree in balance, takes no more than O(lg(n)) time. Also, only do one restructuring adjustment for a insertion or deletion. This is important for some algorithms that you will see if you take CS 31.
- Doing this, we get guaranteed O(lg(n)) runtime for all the Map operations -- insert, lookup, delete.
The current Java documentation indicates that it uses red-black
trees for TreeMap
.
B-Trees
2-3-4 trees are a special case of multiway trees. A multiway tree of particular interest is the B-tree. A B-tree of order d has nodes with between d/2 (rounded up) and d children. So a 2-3-4 tree is a B-tree of order 4. The B-trees of even order are particularly nice, because insertions and deletions are handled in very similar ways to 2-3-4 trees, with splits and fusions.
The place where a B-tree is particularly useful is in storing a TreeMap of size n that is so large that it must be stored on disk. Searching through a regular binary tree stored on disk would require O(log n) comparisons, but it is quite probable that most of the time getting a child's key would require reading a new disk block. Disk reads are many orders of magnitude slower than memory accesses, and O(log n) disk reads could be time-consuming. What is especially galling is that a disk read brings in a entire disk block, usually between 512 bytes and 2048 bytes, and we use only one key of maybe 4 bytes out of that block.
An alternative is to store the TreeMap as a B-tree, whose order is the number of key/value pairs that will fit into a disk block. If the value is some sort of a refererence to where the data is stored on disk, as is usually the case, instead of getting a single key per disk read we get hundreds of them. The depth of the B-tree of order d is still O(log n), but the base of the logarithm is d instead of 2. This is a case where constants matter. Even maps with terrabytes of keys can be stored in a few levels. (Note that if d is 100 and there are 1012 keys that the number of levels of the tree will be 6, as opposed to over 40 in a regular binary tree.)