Index module

Create a wrapper around your hashtable in a file called index.c that focuses on storing counters as items in the hashtable’s set as describe in our last class. In addition to implementing a hashtable of counters, this index module should also provide functions to save an index to disk and to load it into memory from disk. A logical place to store index.c (and its header file) is in the common directory (remember to update make). The querier in Lab 6 will also use these functions.

  1. use typedef to create index_t from hashtable_t (index_t is essentially a wrapper over hashtable_t)

    typedef hashtable_t index_t;

  2. create functions index_X that call corresponding hashtable_X functions

    Example:

    index_t *index_new(int number_slots)
     return (index_t *) hashtable_new(number_slots);
    

    Repeat for other hashtable functions

    Why? You can provide implementation-specific functions such as itemfunc for function such as delete and print because you know what the Index will hold. With a generic hashtable, you don’t know what the table will hold.

  3. create index_save(index_t *index, const char *indexFilename ) to save an index to disk
    safety checks on index and indexFilename
    open fp on indexFilename for writing (with safety check that it opened)
    iterate over hashtable with hashtable_iterate(index, fp, functionToPrintKey) 
       
    functionToPrintKey
        print key
        iterate over each counter with counters_iterate(counter, fp, functionToPrintCounter)
              
    functionToPrintCounter
        print count
    

    Output should be a file named indexFilename with one line for each word in the hashtable, where the word begins the line and is followed by the docID and count of all documents that contain the word. This is similar to the activity from last class.

  4. create index_load(const char *indexFilename) to load an index from disk (you may find functions in libcs50\file.h helpful)

    Example:

    safety checks on indexFilename
    open fp with indexFilename for reading (with safety check that it opened)
    check for empty file (can use file_numlines(fp) to get word count because each word is on one line)
     print error message and return NULL if file contains no words
    create new index with index_new (how many slots? wordcount/2+1?)
    read each word in file with file_readWord(fp) 
     create new counter for word with counters_new()
     expect one or more docID/count pairs on line following word
     while line has docID/count pair
         increment counters with counters_set(counters, docID, count)
     //counters now contains one element for each docID
     add counters to hashtable with hashtable_insert(index, word, counters)
     free(word)
    close(fp)
    return index
    

Output should be an index loaded into memory.

Use index module to implement the TSE indexer

Create index (hashtable) with one entry in table for each word in webpages crawled. Each entry is a Set of Counters. Set has a key for each word, and a value of a counter. The counter has a key of docID and count of how many times the word appeared in the document.

Save the index to disk when done.

indexer pseudo code

parse the command line, validate parameters (pageDirectory and indexFilename)
build the index in memory by processing each file (webpage) in pageDirectory
    index_t *index = indexBuild(pageDirectory);
save index to file with indexFilename
    index_save(file, index);

clean up data structures
    index_delete();

indexBuild(pageDirectory)

Loop over all webpages stored by crawler in pageDirectory, index those pages by looking for each word on each page, make entry in index (hashtable) for each word.

create new index with index_new(number_slots);
set docID =1;
load webpage with webpage_t page = pagedir_load(pageDirectory, docID)
while page not NULL
   index_page(index, page, docID) //make entry for each word in this page
   webpage_delete(page)
   docID++
   page = pagedir_load(pageDirectory, docID)
return index

indexPage(index, page, docID)

Find words on a single page, create a counter for each word, make entry into index (hashtable)

get next word on page with webpage_getNextWord()
while word not NULL
    normalizeWord(word) if length > 2
    get counters for this word from index with index_find(index, word)
    if word not in index
        create new counters with counters_new()
        insert empty counters into index with index_insert(index, word, counters)
    increment word count with counters_add(counters, docID)
    word = webpage_getNextWord()
free(word)