Activity solution - indexer coupling and cohesion
Index module
Create a wrapper around your hashtable
in a file called index.c
that focuses on storing counters
as items in the hashtable’s set
as describe in our last class. In addition to implementing a hashtable of counters, this index
module should also provide functions to save an index to disk and to load it into memory from disk. A logical place to store index.c
(and its header file) is in the common
directory (remember to update make
). The querier in Lab 6 will also use these functions.
use typedef to create index_t from hashtable_t (index_t is essentially a wrapper over hashtable_t)
typedef hashtable_t index_t;
create functions index_X that call corresponding hashtable_X functions
index_t *index_new(int number_slots) return (index_t *) hashtable_new(number_slots);
Repeat for other hashtable functions
Why? You can provide implementation-specific functions such as
for function such asdelete
because you know what the Index will hold. With a generic hashtable, you don’t know what the table will hold. - create
index_save(index_t *index, const char *indexFilename )
to save an index to disksafety checks on index and indexFilename open fp on indexFilename for writing (with safety check that it opened) iterate over hashtable with hashtable_iterate(index, fp, functionToPrintKey) functionToPrintKey print key iterate over each counter with counters_iterate(counter, fp, functionToPrintCounter) functionToPrintCounter print count
Output should be a file named
with one line for each word in the hashtable, where the word begins the line and is followed by the docID and count of all documents that contain the word. This is similar to the activity from last class. -
index_load(const char *indexFilename)
to load an index from disk (you may find functions inlibcs50\file.h
safety checks on indexFilename open fp with indexFilename for reading (with safety check that it opened) check for empty file (can use file_numlines(fp) to get word count because each word is on one line) print error message and return NULL if file contains no words create new index with index_new (how many slots? wordcount/2+1?) read each word in file with file_readWord(fp) create new counter for word with counters_new() expect one or more docID/count pairs on line following word while line has docID/count pair increment counters with counters_set(counters, docID, count) //counters now contains one element for each docID add counters to hashtable with hashtable_insert(index, word, counters) free(word) close(fp) return index
Output should be an index loaded into memory.
Use index module to implement the TSE indexer
Create index (hashtable) with one entry in table for each word in webpages crawled. Each entry is a Set of Counters. Set has a key for each word, and a value of a counter. The counter has a key of docID and count of how many times the word appeared in the document.
Save the index to disk when done.
indexer pseudo code
parse the command line, validate parameters (pageDirectory and indexFilename)
build the index in memory by processing each file (webpage) in pageDirectory
index_t *index = indexBuild(pageDirectory);
save index to file with indexFilename
index_save(file, index);
clean up data structures
Loop over all webpages stored by crawler in pageDirectory, index those pages by looking for each word on each page, make entry in index (hashtable) for each word.
create new index with index_new(number_slots);
set docID =1;
load webpage with webpage_t page = pagedir_load(pageDirectory, docID)
while page not NULL
index_page(index, page, docID) //make entry for each word in this page
page = pagedir_load(pageDirectory, docID)
return index
indexPage(index, page, docID)
Find words on a single page, create a counter for each word, make entry into index (hashtable)
get next word on page with webpage_getNextWord()
while word not NULL
normalizeWord(word) if length > 2
get counters for this word from index with index_find(index, word)
if word not in index
create new counters with counters_new()
insert empty counters into index with index_insert(index, word, counters)
increment word count with counters_add(counters, docID)
word = webpage_getNextWord()