Lab 6 - Querier

Due Wednesday May 18, at 10pm

In this lab you’ll continue the Tiny Search Engine (TSE) by coding the Querier according to the Requirements Spec in the starter kit.

Preparation
Assignment
- Submission - how
Grading
Hints and tips

Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.

Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.

If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.

If valgrind reports memory errors or memory leaks, when querier exits normally, you will lose points.

Preparation

Start with the same repository you used for Lab 5. Before you begin, make sure you submitted Lab 5 correctly, and pushed the tag to your GitHub remote.
Check to ensure your local repo is clean with make clean and everything has been committed and pushed according to git status.
Do not proceed if you have uncommitted changes or unpushed commits. Seek help if you need to sort out your repo or GitHub.

Assignment

Write the third sub-system of the Tiny Search Engine, the querier. Your design and implementation must follow the Querier Requirements Spec (aka “the Specs”), and make good use of our abstract data structures.

As you create files and directories, add them to your git repository. Important: do not add any compiled files to git. In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make and removed by make clean.

Commit and push at least once a day.

In your tse-xxxx directory,

Review the README.md file; add any top-level comments you need to communicate to the grader.
Review Makefile; you may need to uncomment the lines for querier.

In the querier subdirectory,

Write a file DESIGN.md to provide the Design Spec for querier, drawing inspiration from the lecture notes.
Write a file IMPLEMENTATION.md to provide the implementation spec and testing plan for querier. See lecture notes about what should be in an implementation spec.
Write a README.md file to provide any special information about how to compile or test your querier, any assumptions you made while writing the querier, and indicating how much of the functionality you implemented, as described below.
Write a program querier.c according to the Specs and your Design.
Your program shall leverage common code from common.a in the ../common directory:
- a module index.c that contains all the logic for saving and loading index files (this module is common between the Indexer, Querier, and indextest);
- a module pagedir.c that contains all the logic for saving pages to a crawler output directory, for loading pages from a crawler output directory (this module is common between the Crawler and Indexer);
- a module word.c that implements NormalizeWord (this module is common between the Indexer and Querier);
- a Makefile that builds common.a for use by the programs;
- a short README.md file explaining how to build common.a;
- a .gitignore file telling Git to ignore the binary files (object files) in this directory.
Write a Makefile that will, by default, build the querier executable program.
Add a make clean target that removes files produced by Make or your tests.
Add a make test target that tests your querier.
Write a testing.sh bash script that can be run by make test. This script must include good comments describing your tests. For best results, make test should run bash -v testing.sh.
Save the output of your tests with make test &> testing.out.
Write a .gitignore file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.

Your submitted repo should now include the following files, plus any others you created for testing:

├── .gitignore
├── Makefile
├── README.md
├── common			# added by you (Lab4)
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── index.c		# added by you (Lab5)
│   ├── index.h		# added by you (Lab5)
│   ├── pagedir.c
│   ├── pagedir.h
│   ├── word.c		# added by you (Lab5)
│   └── word.h		# added by you (Lab5)
├── crawler			# added by you (Lab4) except REQUIREMENTS, DESIGN
│   ├── .gitignore
│   ├── REQUIREMENTS.md
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── testing.sh
│   ├── testing.out
│   └── crawler.c
├── indexer			# added by you (Lab5) except REQUIREMENTS
│   ├── .gitignore
│   ├── REQUIREMENTS.md
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── indexer.c
│   ├── indextest.c
│   ├── testing.sh
│   └── testing.out
├── libcs50
│   ├── .gitignore
│   ├── Makefile
│   ├── README.md
│   ├── bag.c
│   ├── bag.h
│   ├── counters.c	# if you decided to add your Lab3 solution
│   ├── counters.h
│   ├── file.c
│   ├── file.h
│   ├── file.md
│   ├── hashtable.c	# if you decided to add your Lab3 solution
│   ├── hashtable.h
│   ├── jhash.c
│   ├── jhash.h
│   ├── libcs50-given.a
│   ├── memory.c
│   ├── memory.h
│   ├── memory.md
│   ├── set.c		# if you decided to add your Lab3 solution
│   ├── set.h
│   ├── webpage.c
│   ├── webpage.h
│   └── webpage.md
├── querier		# added by you (Lab6) except REQUIREMENTS, fuzzquery
│   ├── .gitignore
│   ├── REQUIREMENTS.md
│   ├── DESIGN.md
│   ├── IMPLEMENTATION.md
│   ├── Makefile
│   ├── README.md
│   ├── fuzzquery.c
│   ├── querier.c
│   ├── testing.sh
│   └── testing.out

You shall NOT modify any of the code we provide in libcs50.

You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.

Submission - how

When you are ready for final submission,

Add all required files using an appropriate git add commandline.
Commit all changes using an appropriate git commit commandline.
Tag: git tag lab6submit
Push: git push --tags to ensure the tags are pushed to the remote. See more about tags.

Use your web browser to view your remote on GitHub and make sure there is a tag lab6submit and the files associated with that tagged commit include everything we’ll need to grade your lab.

We grade the version tagged lab6submit so it is safe for you to continue work (e.g., so you can start work on Lab 6), even if you commit or push.

We will start grading when we first see one with tag lab6submit, and judge it late if that tag was added after the deadline. To avoid confusion, please email the graduate TA and the instructor if you want a late submission to be graded.

Your entire codebase must compile with make from the top-level directory and produce no compiler warnings.

We will run valgrind on your program to look for memory leaks; you will lose points for memory leaks or memory errors.

Grading

Lab 6 is scored on the basis of 100 points, with Delivery, Documentation, Style, Testing comprising most of the points.

“Functionality” represents 40/100 points. In recognition that you might find it best to start simple and slowly enhance your solution as you get the simpler version working, you can earn points as follows:

10 points if your querier prints the set of documents that contain all the words in the query; you may treat operators (‘and’, ‘or’) as regular words.
20 points if your querier also supports ‘and’ and ‘or’ operators, but without precedence (in mathematical terms, it treats them as left associative, equal precedence operators).
30 points if your querier also supports ‘and’ precedence over ‘or’.
40 points if your querier also prints the document set in decreasing order by score, thus meeting the full specs.

Partial credit is available, of course, per the judgement of the grader, but above is the coarse-grain rubric.

Please indicate in your querier/README.md which of the above subsets of functionality you implemented.

Hints and tips

Many of the Lab4 hints and Lab5 hints are still relevant.

Processing a query and ranking the results are tricky. We encourage you to start with a simplified case, test it thoroughly, then enhance. Easier to code, to test, and to debug, and when facing a deadline it’s nice to have a less-functional program that works than a full-functional program that doesn’t work. See the above section on Grading regarding the points allocated as you achieve higher levels of functionality.

Review the lecture notes

There are some examples and design tips in the lecture notes.

Hashtable

How big should your hashtable be? Well, you can know how many words it will store - because the index file has one word per line, and you can count the number of lines in the index file before creating an index data structure and before loading the file into the structure. Just think about how the hash-table size (in slots) might relate to the number of words it will need to store.

Parsing queries

We strongly recommend that your code read the entire query (a line of input) into a single string, then tokenize the query string. That is, you should write a function that takes a string and builds an array of words; it should use white space (space or tab) as the delimiter; each word can be normalized (lower-cased) and checked for illegal characters before being added to the array. See one approach at bottom.

Now that all the character-by-character parsing is behind you, and you have an array of words, you can step through the array to print a clean query, that is, with no extraneous spaces and all letters in lower case.

You can then step through the array according to the structure defined in the BNF. Two tips:

Validate the basic structure: neither the first or last word may be an operator, and two operators may not be adjacent. If valid, proceed to next step; otherwise print a suitable error message.
Structure your code to follow the structure of the grammar, which has two non-terminals (query and andsequence), to write a function that handles each.
- One function to return set of documents that satisfy an andsequence, looping over words in the andsequence; accumulate an answer (like a running product) as you go.
- Another function to return set of documents that satisfy a query, looping over calls to the above function for each andsequence; accumulate an answer (like a running total) as you go.

See Lecture notes for more hints about how this might work.

Combining results

Suppose you have one counters object representing the set of documents in which a given word appears, and another counters object representing the set of documents in which another word appears; each counter set is really a set of (docID, count) pairs. How do you combine them? Recall Lecture.

If you are processing wordA AND wordB, the set of documents that match both words is the intersection of the two sets, and the score for each document (per the specs) is the minimum of the count for each document. So you need a way to intersect two counters; we recommend iterating over one set and, for each item, checking whether that document exists in the other set; update the first set according to what you find. You can use counters_iterate, counters_get, and counters_set to do this, without modifying your counters module.

If you are processing wordA OR wordB, the set of documents that match either word is the union of the two sets, and the score for each document (per the definition above) is the sum of the count for each document. So you need a way to union two counters; we recommend iterating over the second set and, for each item, checking whether that document exists in the first set; update the first set according to what you find. Again, you can use counters_iterate, counters_get, and counters_set to do this, without modifying your counters module.

While processing an andsequence you can compute a ‘running product’, that is, a counters object that represents the intersection of everything seen so far in the sequence.

When processing a query, that is, a disjunction of andsequence results, you can similarly compute a ‘running sum’, that is, a counters object that represents the union of everything seen so far in the sequence.

Ranking results

After parsing and interpreting a query, you will likely have a counters object representing the score for each satisfying document. The counters module does not have a ‘sort’ method or a way to iterate over the items in sorted order. We suggest you use counters_iterate() in which the itemfunc inserts the items into a new array of structs, each struct representing one satisfying document (docID, score). Use counters_iterate twice - once to count the number of items in the set so you can allocate an appropriate-sized array, and again to stuff each item into the new array. If your itemfunc uses an insertion-sort approach to drop the new item into the array so the array is sorted in decreasing-score order, you end the iteration with a nicely sorted array of structs. Then you can simply loop over the array to print out the list of documents.

Testing your querier

Your querier reads queries from stdin, one per line. You can test it interactively, but to do thorough and repeated testing you can write a collection of little files, each of which contains one or more queries to your querier, and run commands like ./querier < testquery. You might write a short bash script to run the querier through several such test files. That script might even compare the output to known-good output, for regression testing.

We provide fuzzquery.c, a fuzz-testing program, for testing your querier. Its use is optional.

If your indexer never quite worked, never fear. You do not need a working indexer to write or test your querier. Try your querier on the output of our crawler and indexer.

Chopping a string into array of strings

Here’s a nice trick for breaking a single string char* line into an array of strings char* words[]. In the example below, the query has two words, and there are leading spaces, separating spaces, and trailing space, terminated by a null. Because in C a “string” is just a pointer to an array of characters terminated by a null \0 character, we can chop this single string into multiple strings by replacing two of the spaces with nulls, and recording a pointer to the beginning of each word. We don’t need to allocate any memory for these new strings, or copy these strings - they stay right where they are.

diagram of Chopping a string into array of strings

To make this happen, think about sliding two pointers along the array, starting at line. One pointer might be char* word and you slide it over to the first non-space; the other pointer might be char* rest and you slide it from word to the first non-letter. Squash that character *rest with a null, and you’ve created a null-terminated string starting at word.

Think carefully about the edge cases, as you construct the loops and slide the pointers.

ctype

You may find the functions isalpha() and isspace() useful; read the man page. They are from the ctype.h package.

Turning off the prompt

I found it useful to print a prompt for an interactive user (stdin is a “tty”, aka keyboard), but not print a prompt when the stdin is not a keyboard (it makes my test-output look nicer). I wrapped it in a little function to abstract the details:

#include <unistd.h>  // add this to your list of includes
/* The fileno() function is provided by stdio,                                  
 * but for some reason not declared by stdio.h, so declare here.                
 */
int fileno(FILE *stream);

static void
prompt(void)
{
  // print a prompt iff stdin is a tty (terminal)
  if (isatty(fileno(stdin))) {
    printf("Query? ");
  }
}