Hashing

[prev] [next]

We saw that maps and sets are very useful ADTs, allowing us to quickly find things (and potentially associated data). While binary search trees provide one efficient implementation (assuming they're balanced), and lists a simple implementation, hash tables provide a third alternative with some trade-offs.

Hashing
Chaining
Open Addressing
Computing Hash Functions

All the code files for today: BlobHash.java; HashTest.java

Slides from class

Hashing

The HashSet and HashMap classes use a data structure called a hash table. The idea can be illustrated by how the Sears catalog store in West Lebanon used to keep track of catalog orders to be picked up. They had 100 pigeonholes behind the order desk, numbered 0 to 99. When a customer arrived, the clerk would ask for the last two digits of the customer's phone number. Given a number, the clerk would look in the corresponding pigeonhole for the order form. Even if there were several hundred order forms in the system, each pigeonhole contained only a few forms, and the clerk could search through these quickly. This approach effectively split a set of hundreds of items into 100 sets of a few items each. The trick was to find a hash function that would assign each order form to a particular pigeonhole in a way that spread the order forms fairly evenly among them. The last two digits of the customer's phone number worked well for this purpose. (Why would the first two digits be a bad choice?)

We can do a similar approach inside the computer. For a map, we want to use the key of the (key,value) pair to figure out where we are going. For a set we use the object itself. So we'll talk about storing keys in the hash table, possibly with an associated value. In order to implement this idea, we need two things:

A table structure to store the data in. This is simply a Java array.
A way to get from an object to a particular spot in the table. The way this is done is via a hash function, which maps a key to an index in the array. Ideally, the hash function spreads the objects evenly over the table.

Now, if each key mapped to a distinct index in the table, we'd be done. That would be like all Sears customers having the last two digits of their phone numbers be unique. But they are not. When multiple keys map to the same table index, we have a collision. Now, you might think that with 100 slots in the table, we'd have to get close to 100 keys inserted into the table before we'd be likely to have a collision. And you would be incorrect. Have you ever heard of the Birthday Paradox?

If you randomly select people, how many do you have to select before there is at least a 50% chance that at least two of them have the same birthday?

Again, you might think that the number is close to 365. Or maybe it's around half of 365: 183? Nope. It's much smaller than that. Much, much smaller. In fact, it's just 23. In terms of hashing, if we have 365 slots in the table and randomly insert keys, once we have inserted 23 keys, we have at least a 50% chance of a collision. Of course, for a 100% chance, we'd have to insert 366 keys.

Why is this? Nice explanation by Strogatz. Boils down the fact that for n people, there are 365ⁿ possible combinations of birthdays, but only 365*364*363*...*(365-n+1) ways in which they are different. Take the fraction (to get how likely all are different) and subtract from 1 (to get how likely at least one pair is the same). Try it out for a given number: Wolfram calculator.

"Randomly" distributed keys don't do what you might expect. If you "randomly" distribute n keys into a table of size n, you might expect to find one item in most of the slots, with very few empty slots. In fact, about n/e of the slots are empty (where e = 2.71..., the base of the natural logarithm). So over a third of the slots are expected to be empty! The expected length of the longest list is θ(log n/ log log n). (To see how to compute these take CS 30.)

Chaining

So how can we handle collisions? There are a couple of ways. The first one is called chaining. Instead of storing each element directly in the table, each slot in the table references a linked list. The linked list for slot i holds all the keys k for which h(k) = i. Here's the idea:

The keys are k₁, k₂, ..., k₈. We show each linked list as a noncircular, doubly linked list without a sentinel, and table slots without a linked list are null. Of course, we could make a circular, doubly linked list for each table slot instead, and have each table slot reference one such list, even if the list is empty. In some situations, especially when we only insert into the hash table and never remove elements, singly linked lists suffice.

How many items do we expect to look at when searching for a item? For unsuccessful search (it wasn't in the map or set), we would look at everything in the appropriate list. But how many elements is that? If the table has m slots and there are n keys stored in it, there would be n/m keys per slot on average, and hence n/m elements per list on average. We call this ratio, n/m, the load factor, and we denote it by λ. If the hash function did a perfect job and distributed the keys perfectly evenly among the slots, then each list has λ elements. In an unsuccessful search, the average number of items that we would look at is λ. For a successful search we find the element (so always do at least 1 comparison), and look at about half of the elements in the list. This means that for successful search we look at about 1 + λ/2 items. Either way the running time would be θ(1 + λ). Why 1 + λ? Because even if λ < 1, we have to account for the time computing the hash function h, which we assume to be constant, and for starting the search. (Of course, if λ < 1, then we cannot perfectly distribute the keys among all the slots, since a list cannot have a fraction of an element.)

Now, what if the keys are not perfectly distributed? Things get a little trickier, but we operate on the assumption of simple uniform hashing, where we assume that any given key is equally likely to hash into any of the m slots, without regard to which slot any other key hashed into. When we say "any given key," we mean any possible key, not just those that have been inserted into the hash table. For example, if the keys are strings, then simple uniform hashing says that any string—not just the strings that have been inserted—is equally likely to hash into any slot. Under the assumption of simple uniform hashing, any search, whether successful or not, takes θ(1 + λ) time on average.

Of course, the worst case is bad. It occurs when all keys hash to the same slot. It can happen, even with simple uniform hashing, but of course it's highly unlikely. But the possibility cannot be avoided. If an adversary puts n*m items into the table then one of the slots will have at least n items in it. He or she then makes those n items the data for the problem that we are dealing with, and we are stuck. (There is an idea called universal hashing, which basically computes a different hash code every time we run the program, so that data that is slow one time might be fast the next.)

Should the worst case occur, the worst-case time for an unsuccessful search is θ(n), since the entire list of n elements has to be searched. For a successful search, the worst-case time is still θ(n), because the key being searched for could be in the last element in the list.

How about inserting and removing from a hash table with chaining? To insert key k, just compute h(k) and insert the key into the linked list for slot h(k), creating the linked list if necessary. That takes θ(1) time. How about removing an element? If we assume that we have already searched for it and have a reference to its linked-list node, and that the list is doubly linked, then removing takes θ(1) time. Again, that's after having paid the price for searching.

Note that if n gets much larger than m, then search times go up. How can we avoid this? The same way that we do for an ArrayList. When the table gets too full, we create a new one about double the size and rehash everything into the new table. What is "too full"? Java implementations typically start the table with size 11 and double the table size when λ exceeds 0.75.

Everything is peachy now, right? Yes, except that this doesn't make particularly compact use of memory, in that it has lots of linked lists spread out all over the place. (This requires some overhead, doesn't take advantage of locality, etc.) In an embedded system or handheld device, we might want to avoid such complexity and waste.

Open Addressing

The second way to handle collisions is called open addressing. The idea is to store everything in the table itself, even when collisions occur. There are no linked lists.

How can we store everything in the table even when there's a collision? One simple scheme is called linear probing. Suppose we want to insert key k and that h(k) = i, but slot i is already occupied. We cannot put key k there. Instead, we probe (look at) slot i + 1. If it's empty, put key k there. If slot i + 1 is occupied, probe slot i + 2. Keep going, wrapping around to slot 0 after probing slot m − 1. As long as n < m, i.e., λ < 1, we are guaranteed to eventually find an empty slot. If the table fills, that is, if λ reaches 1, then increase the table size and rehash everything.

Searching with linear probing uses the same idea as inserting. Compute i = h(k), and search slots i, i + 1, i + 2, …, wrapping around at slot m − 1 until either we find key k or we hit an empty slot. If we hit an empty slot, then key k was not in the hash table.

Linear probing is a nice idea, but it has a couple of problems. One is how to remove a key from the hash table. We cannot just remove it from its slot. Why not? Let's take the following situation. We insert keys k₁, k₂, and k₃, in that order, where h(k₁) = h(k₂) = h(k₃). That is, all three keys hash to the same slot, but k₁ goes into slot i, k₂ goes into slot i + 1, and k₃ goes into slot i + 2. Then we remove key k₂, which opens up a hole in slot i + 1. Then we search for key k₃. What's going to happen? We probe slot i, see that it holds k₁, not k₃, and then probe slot i + 1. Because slot i + 1 is now empty, we conclude that k₃ is not in the hash table. Wrong! How can we correct this error? We need some way to mark a slot as having held a key that we have since removed, so that it should be treated as full during a search but as empty during insertion. So if we remove many keys, we can end up with all (or almost all) of the "empty" slots being marked, and unsuccessful searches can go on for a very long time.

But there's another, more insidious, problem. It's called clustering. Long runs of occupied slots build up, increasing the average search time. Clusters arise because an empty slot preceded by t full slots gets filled next with a probability of (t + 1)/m. That's because the probability of hashing to any of the t slots is 1/m (and there are t of them), plus the next key could hash directly into that empty slot. And that's not the only way that clusters grow. Clusters can coalesce when one cluster grows and meets with another cluster.

There are various schemes that help a little. One is double hashing, where we take two different hash functions h₁ and h₂ and make a third hash function hʹ from them: h'(k,p) = (h₁(k) + ph₂(k)) mod m. The first probe goes to slot h₁(k), and each successive probe is offset from the previous probe by the amount h₂(k), all taken modulo m. Now, unlike linear or quadratic probing, the probe sequence depends in two ways upon the key, and as long as we design h₁ and h₂ so that if h₁(k₁) = h₁(k₂) it's really unlikely that h₂(k₁) = h₂(k₂), we can avoid clustering. Again, we must choose our hash functions h₁ and h₂ carefully.

When we analyze open addressing, we use the assumption of uniform hashing, which says that the probe sequence hʹ(k, 0), hʹ(k, 1), hʹ(k, 2), …, hʹ(k, m − 1) used when searching or inserting is equally likely to be any permutation of the slot numbers 0, 1, 2, ..., m − 1. In other words, each possible probe sequence is equally likely.

The result is that the average number of probes in an unsuccessful search or when inserting is approximately 1/(1-λ). This number gets very large as λ approaches 1. If the hash table is 90% full, then about 10 probes are needed. The average number of probes for a successful search isn't so hot, either: -(1/λ) log(1/(1-λ)). That's a little better than the number of probes for an unsuccessful search; when the hash table is 90% full, it comes to a little over 2.5 probes. So here we definitely need to keep the table relatively sparsely populated.

Computing Hash Functions

The remaining problem is how to compute a hash function anyway, mapping from keys to spots in a table. A good hash function has three desirable properties:

It can be computed quickly.
It spreads the universe of keys fairly evenly over the table.
Small changes in the key (e.g. changing a character in a string or changing the order of the letters) should usually result in a different hash value.

Java computes hash functions in two steps. The first is to have every object have a hashCode method. This returns an int. The default is often the address of the object in memory. Then we have to map this integer into an entry in the table. This is called the compression function.

A simple compression function is to take the hashcode mod the table size. This works well when the table size is a prime, but not so well otherwise. In that case the book suggests computing [(ai + b) mod p] mod N, where i is the hashcode, p is a prime > N, and a and b are chosen at random between 1 and p-1 (or can also let b be 0). But this is taken care of in the implementation of Java's library.

Java gives us the responsibility to come up with a good hashcode. In particular, if we override equals for an object it is very important that we override hashCode so that two items considered equal have the same hashcode. This is because if we insert an item in the table and then look for it using something that is equal to the item, we expect to find it. If the hashcodes are different, we won't find it, since we will be looking in the wrong bucket.

So how can we compute hash codes, particularly for composite objects (have several instance variables or are strings or arrays or ...). A simple option is to sum all of the pieces, or sum the hashcodes of the pieces. However, this means that shuffling the order (say of a string) does not change the hashcode. It probably should! A better idea for computing the hash code of what we can view as a tuple (x₀, x₁, ..., x_k-1) is to pick a positive number a (book suggests that 33, 37, 39, and 41 are good choices) and compute the polynomial x₀a^k-1 + x₁a^k-2 + ... + x_k-1. Book shows how to do this using Horner's rule, which basically means starting with a running total at x₀ and for j = 1 to (k-1) multiplying by a and adding x_j.

public int hashcode() {
  final int a = 37;
  
  int sum = x[0];
  for (int j = 1; j < s; j++)
    sum = a * sum + x[j];
    
  return sum;
}

The book also shows a way to use cyclic shifts for this purpose. For the curious, there is a lot of literature on hashing, including Knuth's Art of Computer Programming, Vol. 3.