CS 33 (Spring 2011): Homework 6: B+-tree Indexing
Deadline: 10:00pm on May 9, 2011 (Free extension until 10:00pm on May 11, 2011)
This homework is more open-ended than previous ones.
Rather than asking you to produce very specific output, it asks
you to implement some algorithms from the textbook, experiment
with them, and report results.
Creating, Querying, and Updating a B+-tree
You are asked to produce working implementations of the following
two algorithms from the textbook.
- The B+-tree querying algorithm in Figure 11.11
(page 489).
[20 points]
- The B+-tree single-item insertion algorithm in
Figure 11.15 (page 494).
[20 points]
Instead of performing actual disk I/O, you should make calls to
methods of a "virtual disk manager" that will move data between a
"virtual disk" (implemented as a binary file) and a "virtual DB
buffer" (implemented as a list/array of M bytearrays, each bytearray
being a chunk of B bytes, where M is the buffer size in blocks, and
B is the size of a block in bytes). Such a virtual disk manager has
been implemented for you in Python, in
~cs33/data/diskmgr.py
and you should study that first. Note especially the methods
- read_block, write_block, append_block:
for performing I/Os,
- get_bytes, set_bytes: for working with
blocks in the buffer.
Your code should be disciplined and use only these methods for
working with the data. Of course, you are free to manipulate the
data in one or two blocks by storing them in temporary variables of
your own, but it is important that your code not "cheat",
i.e., you should not use any large storage area of your own to get
around the memory limitation modeled by the limited size of the
buffer.
Interpreting the data in a block and organizing the blocks on
disk are both left up to you. Also, if you need sophisticated buffer
management and replacement policies, you should implement that on
your own; the code in ~cs33/data/diskmgr.py implements only
an LRU policy, though it does have methods for pinning, unpinning,
and tossing a block.
Experiments to Perform
Having done the above, write code to perform the following
experiment. Suppose you start out with a relation student (as
in our university database) stored on disk. The file
~cs33/data/students.sql
contains sample data for this relation. The binary file
~cs33/data/student_unsorted.dat contains the same data,
stored in the same order, in 50-byte records, structured as
indicated in the comments for the
demo function in
diskmgr.py.
- Use repeated single-item insertions to create a
B+-tree index for this relation, with student_id
as the search key, stored on disk in blocks immediately following
the data blocks. Count the total number of I/Os performed; use the
provided method get_io_count in diskmgr.py.
[10 points]
- Redo the above, but with the disk initially containing the
same data sorted on student_id. A sorted version of the
virtual disk is in ~cs33/data/student_sorted.dat. Do you
notice any significant difference in the I/O count? Try this with
the parameters M = 10, B = 200 (as in the demo function
in diskmgr.py). Also try it with other parameter settings
if you like.
[10 points]
- Perform several "random" searches, both successful and
unsuccessful, using your B+-tree index. Report the
number of I/Os typically required. Is there a difference between
the two cases, i.e., whether the index was created from a sorted
data file or not?
[10 points]
What to turn in
Using the homework
submission form, turn in a single zip (or tar/gzip) archive
consisting of the following files:
- All of your source code. You are not forced to code in Python,
but if you use a different language, you must first translate
diskmgr.py to its moral equivalent in your language. You
may also make "slight" modifications/additions to
diskmgr.py if you badly need to.
- A text file with your experimental observations, including
output from example runs, as appropriate.
- One README file with instructions for compiling (if needed)
and running your program, so that we can reproduce your
experimental results.