CS 33 (Spring 2010): Homework 3

CS 33 (Spring 2010): Homework 7: Extendable Hashing

Please read through this page entirely and carefully, before beginning your work.

There is just one problem in this homework. This is because you are expected to be working on the course project as well.

Implement the extendable hashing scheme from the textbook, using buckets that can hold up to 3 records each. Use your implementation to build a hash index for the data in ~cs33/data/film.txt on the film_name attribute, reading in the records one by one, in the order in which they appear in the file. The single quotes are not part of the film name. Each record in each bucket should be a 256-byte string and should hold the corresponding line of the data file verbatim (quotes, commas and terminating newline included). You may assume that each line of the data file is at most 255 characters long.
The grading of your implementation will take into account the clarity and beauty of your code, in addition to its correctness.
[30 points]

Use the hash function cs33_hash found in ~cs33/data/murmur_hash.c to hash the keys (i.e., the film names) to 32-bit values. Of course, you will only be using a prefix of this hash value, consisting of its most significant few bits.

Your program should read exactly one argument from the command line (report an error if the number of arguments is not one) and then behave in one of the following ways, depending on the argument type.

If the argument is a positive integer, n: Immediately after inserting the nth record in the hash table (or the last record, if the file contains fewer than n records in total), display the current state of the table (both the bucket address table and the contents of all allocated buckets) on standard output. You should write a suitable pretty-printing procedure/function for this purpose. Display the pointers in the bucket address table in "%p" format (in printf notation). When displaying the contents of a bucket, simply print out each record as a string. Continue processing the file, up to the end. After processing the entire file, print (on standard output) the hash prefix length used in the bucket address table and the total number of buckets created to hold the records. Use this format:
"Final hash prefix length = %d\nNumber of buckets = %d\n"
[25 points]

If the argument is any string other than a positive integer: Treat the argument as the name of a film. Process the entire file silently, then use your hash table to search for the film names given on the command line. Print (on standard output) the entire contents of the bucket containing the record for this film name (as specified above). If the film name is not found, print an appropriate error message. Note: Since the film names have spaces in them, you will have to invoke the program in one of the following ways:
$ progname MUPPET\ MILE
$ progname 'MUPPET MILE'

[20 points]

You must program in a language where you can control memory allocation yourself: C and C++ are best for this (Java works too, though you'll have to translate murmur_hash.c yourself, first). Perl and Python are not too good for this. If you wish to use a language other than C/C++/Java, get clearance from the course staff first.

Write a single program with all of the above functionality and turn in its source code. You may use tar/gzip or zip if you have multiple source files.

Provide instructions to compile and run your program on "sunapee", either as comments on top of the source code, on in a separate file with a ".txt" extension. If you wish, you may provide a Makefile.

Submit your files (there's a limit of 3) using the homework submission form. Keep in mind: this homework is numbered "Homework 7".