CS 33 (Spring 2011): Homework 7

CS 33 (Spring 2011): Homework 7: Extendable Hashing

Deadline: 10:00pm on May 16, 2011 (Free extension until 10:00pm on May 17, 2011)

As with the previous homework, this one is also open-ended in the sense that, rather than asking you to produce very specific output, you are asked to implement some algorithms from the textbook, experiment with them, and report results.

Implement the extendable hashing scheme from the textbook to work with the "virtual disk manager" you used in the previous homework, i.e., ~cs33/data/diskmgr.py; keep your code generic enough that it can be used with various different values of M and B, where M is the (memory) buffer size in blocks, and B is the size of a block in bytes.

Each bucket in your hash index data structure should be able to hold some fixed number of index records plus an overflow bucket pointer. An index record consists of a search key value and a corresponding record pointer, which points to the actual record on disk. (Thus, your data structure has one more level of indirection than the extendable hash file example in the textbook.) If you like, you may assume that 2 bytes suffice for a record pointer.

Your implementation should be able to work with a generic hash function that hashes keys to 32-bit values. Of course, you will be using only a prefix of this hash value, consisting of its most significant few bits. Also, allow your implementation to work with a generic value of the parameter max_splits, which controls the maximum number of times a bucket is allowed to be split during a single insertion.

[30 points]

Using your implementation, prepare a hash index of the student table, as stored on the "virtual disk" given by the file ~cs33/data/student_unsorted.dat. Have your code append the disk blocks used for the hash index at the end of the data blocks. A description of the contents of the file appears in the previous homework. Use student_id as the search key for this index. Report the following statistics as experimental results: length of hash prefix used in bucket address table, number of primary buckets, number of overflow buckets, maximum length of overflow chain, length of hash prefix used in each primary bucket.

Use the identity function as your hash function, i.e., the hash of a student ID is the ID itself (taken as a 32-bit integer value).

[20 points]

Perform several "random" searches, both successful and unsuccessful, using your hash index. Report the number of I/Os typically required.

[10 points]

Repeat the above steps and experiments using student_name as the search key instead. Use the function cs33_hash found in ~cs33/data/murmur_hash.c to hash the keys (i.e., the student names) to 32-bit values.

[20 points]

What to turn in

Using the homework submission form, turn in a single zip (or tar/gzip) archive consisting of the following files:

All of your source code. You are not forced to code in any particular language. You may also make "slight" modifications/additions to diskmgr.py if you badly need to.

A text file with your experimental observations, including output from example runs, as appropriate.

One README file with instructions for compiling (if needed) and running your program, so that we can reproduce your experimental results.