Hierarchies 3: balance, 2-3-4 and Red/Black Trees


We have seen that insertion, search, deletion, finding the max and min in a binary search tree can be done in time proportional to the height of the tree. However, as we also saw, a binary tree can have height O(n) in the worst case. This occurs, for example, when all items are inserted in increasing order. (There are many other such configurations, too). So, how do we insure that this doesn't happen?

Slides from class

Balanced Binary Search Trees

First idea: Each time you insert or delete, "fix up" the tree so that it is always as close to balanced as possible. Problem: This isn't as easy as it looks. Consider this tree:

50 / \ 30 70 / \ / 20 40 60

This is balanced just fine. Now, insert 10. A perfectly balanced tree with these keys would look like this:

40 / \ 20 60 / \ / \ 10 30 50 70

Note that not a single parent-child relationship is preserved between the original tree an the new, balanced one. So, it's possible that keeping maintaining balance could take O(n) time, on each update! Well, maybe not every update, but every second one. Consider deleting 70 and inserting 0, then deleting 60 and inserting -10, .... That's no good, since the whole point was to make the operations run fast.

Potential Solutions. We have a binary search tree, we could:

  1. Give up "binary" -- allow nodes to have varying numbers of children. Leads to 2-3-4 trees, B-trees.
  2. Give up "perfect" -- keep it "close" to perfectly balanced. Leads to AVL trees, Red-Black trees.

We will first look at 2-3-4 trees, and then relate them to a structure called red-black trees, which are easier to implement.

2-3-4 Trees

The book calls these (2,4) trees. Both names are used.

Intuition:

a a:b a:b:c / \ / | \ / / \ \

Rules for 2-3-4 Trees

  1. Every node has either 2, 3, or 4 children (1, 2, or 3 keys).
  2. All the leaves of the tree (either external nodes or null pointers) are on the same level.

To insert, you first go as far down the tree as you can. So we build a tree by inserting 41, 38, 31:

41 -> 38:41 -> 31:38:41 / \ / | \ / / \ \

In general, go down to a node with external children, and if a 2 or 3 node, expand to a 3 or 4 node, adding the new key in the right location.

If the node that you get down to is a 4-node, SPLIT it into two 2-nodes, promote the middle key to join the parent node (creating a new root if the current root is split), and then continue following down into the correct 2-node. Note that promoting the middle key to the parent may add a fourth key to the parent, which then has to split, etc.

Now let's insert 12. No room in the 4 node, so split:

38 38 / \ / \ 31 41 now insert 12 --> 12:31 41 / \ / \ / | \ / \

Now let's insert 19: No problem, goes into the 3 node:

38 / \ 12:19:31 41 / / \ \ / \

But now if we insert 8, no room in the 4 node, so split, promoting 19:

19:38 19:38 / | \ / | \ 12 31 41 now insert 8 --> 8:12 31 41 / \ / \ / \ / | \ / \ / \

Now insert 15 (unproblematic), and 17 (one split):

19:38 19:38 / | \ / | \ 8:12 31 41 -- 15 --> 8:12:15 31 41 -- 17 --> / | \ / \ / \ / / \ \ / \ / \ 12:19:38 / | | \ 8 15:17 31 41 /\ /|\ /\ /\

Now insert 33 and 35 (unproblematic):

12:19:38 12:19:38 / | | \ / / \ \ 8 15:17 31 41 8 15:17 31:33 41 /\ /|\ /\ /\ /\ /|\ /|\ /\ 12:19:38 / / \ \ 8 15:17 31:33:35 41 /\ / | \ / / \ \ /\

Now insert 20 -- drives a double split. We want to split 31:33:35, so we can insert 20, but we cannot, because 12:19:38 is already a 4-node (full). So first, we must split that one.

19 /------+------\ 12 38 << after first split (root) / \ /-----|-----\ 8 15:17 31:33:35 41 /\ / | \ / / \ \ / \ 19 /------+------\ 12 33:38 << after second split / \ / | \ 8 15:17 31 35 41 /\ / | \ /\ /\ /\ 19 /------+------\ 12 33:38 / \ / | \ 8 15:17 20:31 35 41 << 20 finally inserted /\ / | \ /|\ /\ /\

Problems with 2-3-4 trees

This is tricky to implement. You could:

The big advantage binary nodes have is that they're simple and small; you always need at least TWO children anyway, so you can keep track of what's on either side of a particular key. So how can we have the nice balance of 2-3-4 trees, but with only binary nodes?

Red-Black Trees

Around 1984, Guibas and Sedgewick came up with the following clever idea:

2-nodes: a -> a / \ / \ 3-nodes: a:b -> a OR b (Here a double // or \\ means / | \ / \\ // \ that the child is RED) b a / \ / \ 4-nodes: a:b:c -> b / / \ \ // \\ a c / \ / \

So, basically we have encoded each 2, 3, or 4 node using only binary nodes, but we've painted each node with a "color", either red or black. We paint a node red if it's part of a "bigger" node. This is called a red-black tree.

One of our 2-3-4 trees:

12:19:38 / / \ \ 8 15:17 31:33 41 /\ /|\ /|\ /\

becomes a red-black tree:

19 // \\ 12 38 / \ / \ 8 17 31 41 /\ // \ / \\ /\ 15 33 /\ /\

So, now, one simple way to think about Red-Black trees is to just apply the 2-3-4 tree rules, bearing in mind this encoding.

Red-Black Properties

But, you can also think of a Red-Black tree as a structure separate from a 2-3-4 tree. The rules are:

  1. Every node is either RED or BLACK; by convention, the empty tree is a leaf, and is always considered BLACK.
  2. The root is black. If an action colors it red, change it back to black.
  3. For any node which is RED, both of its children are BLACK (this means: No two consecutive red nodes along any path)
  4. Every path from any node to a descendant leaf has the same number of black nodes.

Claim: The Red-Black properties (1-4) mean that the depth of the tree is O(lg(n)) if there are n nodes in the tree.

Informal justification:

Thus the time complexity of the `lookup' operation is O(h), which we just argued is O(lg(n)) in the worst case.

Insertion

As usual for a binary search tree, find the location where the new element goes, and insert it there. Initially, color it RED. This insures that rules 1, 2 and 4 are preserved. But rule 3 might be violated (a red node has black children). There are several cases (assume we're inserting a key x).

Go through all of the 2-3-4 trees and show how the operations act on red-black trees.

Note that even in the worst case, we only have to fix colors along the path from the new node to the root, so that's only a constant factor of additional work. Furthermore, it can be proved that we will never have to do more than one rotation or double-rotation to fix up the tree. All the other changes can be done with color-flips. So, insert is also O(lg(n)).

Deletion

Deleting from both these sorts of trees is a bit more tricky, but it doesn't increase the time complexity -- it can still be done in O(lg(n)) time, worst case. The additional complexity comes because there is one way of splitting nodes, but several ways of merging nodes together (which you sometimes have to do when deleting).

The key to deletion is to make it so that we just have to delete a node at the bottom of the tree. If the node is currently internal, we do basically the same thing that we did for binary search tree deletion: find the immediate predecessor or successor and use its key as a replacement for the one to be deleted. So now we're left with the problem of deleting the predecessor/successor at the bottom of the tree.

In 2-3-4 trees, if we are deleting a 3 or 4 node, we are done. In red-black trees this corresponds to a vertex that is red or whose child is red. Color the child black and we are done.

In the example below we delete 10 by replacing it with its immediate predecessor or successor (we chose predecesssor) just as we would do when deleting in a BST. Replacing 10 by 8 also requires us to delete 8 out of the 3-node that it currently belongs to (easy to do since the node is a 3-node). In the red-black tree, it has a red child that can easily replace it without changing the number of black nodes along that path:

10 8 / \ / \ 7:8 12:13 -> 7 12:13 / \ / | \ / \ / | \ 10 8 / \ / \ 8 12 -> 7 12 // \ / \\ / \ / \\ 7 13 13 / \ / \ / \

The problem comes when deleting a 2-node. Call this vertex v. Deletion would leave a 1-node, which is not valid. In a red-black tree this creates a node that is double-black to indicate that this path is short one black edge, so this edge counts as two black edges. The cases are:

  1. If w is an adjacent sibling of v that is a 3 or 4 node, we adopt (steal?) a child of w and give it to v. To maintain the order we have to move a key from w up to their common parent and a key from the parent down to v. In a red-black tree this corresponds to a trinode restructuring or a slightly more complicated "adjustment". The book has the details. So to delete 7: 8 12 / \ / \ 7 12:13 -> 8 13 / \ / | \ / \ / \ 8 12 / \ / \ 7 12 -> 8 13 / \ / \\ / \ / \ 13 / \
  2. If all of v's adjacent siblings are 2-nodes we fuse v with a sibling. To maintain the order we need a key between them, which we pull down from the parent. If the parent is a 3 or 4-node this is easy. If not we have passed the underflow up to the parent node, which then must be resolved at that level by adopting or fusing. The fusing operation can be passed up the tree to the root. If the root loses its only key it disappears and the tree is one level shorter. In a red-black tree this results in a re-coloring that passes the double black up to the parent, where it must be resolved. Removing 3: 12 12 / \ / \ 7 20 -> _ 20 -> 12:20 / \ / \ | / \ / | \ 3 9 15 30 7:9 15 30 7:9 15 30 / \ / \ /\ /\ / | \ /\ /\ / | \ /\ /\ 12 12 12 / \ /// \ / \\ 7 20 -> 7 20 -> 7 20 / \ / \ / \\ / \ / \\ / \ 3 9 15 30 9 15 30 9 15 30 / \ / \ /\ /\ / \ /\ /\ / \ /\ /\

The ideas are straightforward, but the implementation can be a bit messy. The book covers it in detail.

Synopsis

Binary search trees are a good data structure for implementing the Map ADT, but their performance is best when they are kept in balance.

The current Java documentation indicates that it uses red-black trees for TreeMap.

B-Trees

2-3-4 trees are a special case of multiway trees. A multiway tree of particular interest is the B-tree. A B-tree of order d has nodes with between d/2 (rounded up) and d children. So a 2-3-4 tree is a B-tree of order 4. The B-trees of even order are particularly nice, because insertions and deletions are handled in very similar ways to 2-3-4 trees, with splits and fusions.

The place where a B-tree is particularly useful is in storing a TreeMap of size n that is so large that it must be stored on disk. Searching through a regular binary tree stored on disk would require O(log n) comparisons, but it is quite probable that most of the time getting a child's key would require reading a new disk block. Disk reads are many orders of magnitude slower than memory accesses, and O(log n) disk reads could be time-consuming. What is especially galling is that a disk read brings in a entire disk block, usually between 512 bytes and 2048 bytes, and we use only one key of maybe 4 bytes out of that block.

An alternative is to store the TreeMap as a B-tree, whose order is the number of key/value pairs that will fit into a disk block. If the value is some sort of a refererence to where the data is stored on disk, as is usually the case, instead of getting a single key per disk read we get hundreds of them. The depth of the B-tree of order d is still O(log n), but the base of the logarithm is d instead of 2. This is a case where constants matter. Even maps with terrabytes of keys can be stored in a few levels. (Note that if d is 100 and there are 1012 keys that the number of levels of the tree will be 6, as opposed to over 40 in a regular binary tree.)