CS 33 (Spring 2011): Homework 8

CS 33 (Spring 2011): Homework 8: Join Evaluation

Deadline: 10:00pm on May 28, 2011

This final homework is definitely the most intricate, so you have several extra days to work on it. Taken together with Homeworks 6 and 7, this constitutes a mini-project, implementing some of the innards of a database system.

Yet again, you are asked not for very specific output, but to implement some algorithms from the textbook, experiment with them, and report results.

Part I: Preparing a Disk Image from an SQL Script

Prepare a disk image univ.dat to be accessed through now-familiar "virtual disk manager" in

~cs33/data/diskmgr.py

This disk image is to contain the data in a number of tables in a simplified university database, specifically, department, course, instructor, teaches, student, and takes. The schemata and data for these tables can be found in ~cs33/data/univ.sql, which is an SQL script.

The disk image should also contain metadata: the names of the tables and indices; for each table, its schema and a pointer to the start of the table's data; for each index (see below for more about indices), the name of the table and attribute (column) that the index is on and a pointer to the start of the index itself. I don't want to be too rigid about the format of this metadata, but here is one suggestion (assuming a table name fits in 14 bytes). Starting at block 0 of the disk image, we see:

      +-----------------------------------+
      | header_length_in_blocks (2 bytes) |
      +-------------------------+---------+---------------+
      | table_ptr_1 (2 bytes)   | table_name_1 (14 bytes) |
      +-------------------------+-------------------------+
      | table_ptr_2 (2 bytes)   | table_name_2 (14 bytes) |
      +-------------------------+-------------------------+
      |        . . .                      . . .           |
      |        . . .                      . . .           |
      |        . . .                      . . .           |
      +-------------------------+-------------------------+
      | table_ptr_n (2 bytes)   | table_name_n (14 bytes) |
      +-------------------------+---------------+---------+
      | (null padding to                                  |
      | <header_length> blocks)                           |
      +---------------------------------------------------+

Then at block number <table_ptr_1> of the disk image, we see the following information about Table 1:

      +---------------------------------+
      | num_columns (1 byte) (= k, say) |
      +-----------------------+---------+------------+--------------------------------+--------------------------------+
      | col_1_name (14 bytes) | col_1_type (2 bytes) | col_1_tree_index_ptr (2 bytes) | col_1_hash_index_ptr (2 bytes) |
      +-----------------------+----------------------+--------------------------------+--------------------------------+
      | col_2_name (14 bytes) | col_2_type (2 bytes) | col_2_tree_index_ptr (2 bytes) | col_2_hash_index_ptr (2 bytes) |
      +-----------------------+----------------------+--------------------------------+--------------------------------+
      |        . . .                   . . .                      . . .                            . . .               |
      |        . . .                   . . .                      . . .                            . . .               |
      |        . . .                   . . .                      . . .                            . . .               |
      +-----------------------+----------------------+--------------------------------+--------------------------------+
      | col_k_name (14 bytes) | col_k_type (2 bytes) | col_k_tree_index_ptr (2 bytes) | col_k_hash_index_ptr (2 bytes) |
      +-----------------------+----------------------+--------------------------------+--------------------------------+

Assume that the type of a column is always one of the following:

tinyint unsigned (requires 1 byte to store),
smallint unsigned (requires 2 bytes),
integer unsigned (requires 4 bytes), and
varchar(N) for some length 0 ≤ N ≤ 255 (requires N bytes to store, with null padding).

Use an appropriate scheme to represent this type information within 2 bytes. The index pointers are, as the name suggests, pointers to the appropriate types of indices on the corresponding column. When an index is not present, it is indicated by having the pointer field set to zero.

For this first part of the homework, simply create the disk image from the SQL file so that it holds the table data and metadata only. Do not create any index yet. Have each table's rows appear in the same order as in the SQL file. Write the disk image to univ.dat. I don't particularly care how you generate the disk image, so long as you document how you did it, and you certainly don't need to parse the SQL script. Feel free to hardwire the data from the SQL script into your code. Also, note that you don't need to use the disk manager at this stage.

[15 points]

Part II: Creating Indices

Using your code from Homework 6 and Homework 7, create both B⁺-tree and hash indices for the following relations, on the given attributes:
- course, on course_id
- instrcutor, on i_id
- student, on s_id
- teaches, on i_id and course_id
- takes, on s_id and course_id
You may choose any reasonable hash function for your hash indices; I leave this up to you.

At this stage, you will be using the disk manager. Report the I/O cost for each index creation as part of your experimental results. I haven't stipulated the values for M and B, where M is the (memory) buffer size in blocks, and B is the size of a block in bytes. Choose values that give you interesting results for this homework; the default values in diskmgr.py are a start.

Let each index get appended to the disk image. Remember to update your metadata (index pointers) correctly.

[15 points]

Part III: Computing Joins

Finally, write code for each of the following equi-join algorithms. You will only have to join two tables at a time, and the join condition will always be equality on a single attribute. That is, even though some pairs of relations may have two or more attributes in common, for simplicity you will be asked to join on one attribute only.

Blocked nested-loops join.

[20 points]

Indexed nested-loops join (the index may be either a B⁺-tree index or a hash index).

[20 points]

Sort-merge-join. For this one, if it helps you keep things simple, you may make M large enough so that not too many passes are required to sort. But don't make it so large that every table simply fits in memory!

[30 points]

Have your code read three command-line arguments: the names of the two relations to join, and the attribute to join on. For instance, if your joining program is written in Python and called join.py, we may invoke it as follows:

python join.py student takes s_id

The program should then compute the join using each of the above algorithms (using each of the available indices, if appropriate) and display the result and the I/O cost (determined by calling get_io_count in diskmgr.py) in each case. Of course, the results should be the same in each case, up to reordering of rows!

What to turn in

Using the homework submission form, turn in a single zip (or tar/gzip) archive consisting of the following files:

All of your source code. You'll have to write at least three programs, one for each part.

A text file with your experimental observations, including output from example runs, as appropriate. You may want to submit two versions of the disk image file: before and after creation of indices in Part II.

One README file with instructions for compiling (if needed) and running your program, so that we can reproduce your experimental results.