CS 33 (Spring 2011): Homework 6: B+-tree Indexing

Deadline: 10:00pm on May 9, 2011 (Free extension until 10:00pm on May 11, 2011)

This homework is more open-ended than previous ones. Rather than asking you to produce very specific output, it asks you to implement some algorithms from the textbook, experiment with them, and report results.

Creating, Querying, and Updating a B+-tree

You are asked to produce working implementations of the following two algorithms from the textbook.

  1. The B+-tree querying algorithm in Figure 11.11 (page 489).
    [20 points]

  2. The B+-tree single-item insertion algorithm in Figure 11.15 (page 494).
    [20 points]

Instead of performing actual disk I/O, you should make calls to methods of a "virtual disk manager" that will move data between a "virtual disk" (implemented as a binary file) and a "virtual DB buffer" (implemented as a list/array of M bytearrays, each bytearray being a chunk of B bytes, where M is the buffer size in blocks, and B is the size of a block in bytes). Such a virtual disk manager has been implemented for you in Python, in

 ~cs33/data/diskmgr.py 
and you should study that first. Note especially the methods

Your code should be disciplined and use only these methods for working with the data. Of course, you are free to manipulate the data in one or two blocks by storing them in temporary variables of your own, but it is important that your code not "cheat", i.e., you should not use any large storage area of your own to get around the memory limitation modeled by the limited size of the buffer.

Interpreting the data in a block and organizing the blocks on disk are both left up to you. Also, if you need sophisticated buffer management and replacement policies, you should implement that on your own; the code in ~cs33/data/diskmgr.py implements only an LRU policy, though it does have methods for pinning, unpinning, and tossing a block.

Experiments to Perform

Having done the above, write code to perform the following experiment. Suppose you start out with a relation student (as in our university database) stored on disk. The file

 ~cs33/data/students.sql 
contains sample data for this relation. The binary file ~cs33/data/student_unsorted.dat contains the same data, stored in the same order, in 50-byte records, structured as indicated in the comments for the demo function in diskmgr.py.

  1. Use repeated single-item insertions to create a B+-tree index for this relation, with student_id as the search key, stored on disk in blocks immediately following the data blocks. Count the total number of I/Os performed; use the provided method get_io_count in diskmgr.py.
    [10 points]

  2. Redo the above, but with the disk initially containing the same data sorted on student_id. A sorted version of the virtual disk is in ~cs33/data/student_sorted.dat. Do you notice any significant difference in the I/O count? Try this with the parameters M = 10, B = 200 (as in the demo function in diskmgr.py). Also try it with other parameter settings if you like.
    [10 points]

  3. Perform several "random" searches, both successful and unsuccessful, using your B+-tree index. Report the number of I/Os typically required. Is there a difference between the two cases, i.e., whether the index was created from a sorted data file or not?
    [10 points]

What to turn in

Using the homework submission form, turn in a single zip (or tar/gzip) archive consisting of the following files: