Lab 6 - Querier
Due Wednesday May 18, at 10pm
In this lab you’ll continue the Tiny Search Engine (TSE) by coding the Querier according to the Requirements Spec in the starter kit.
Grading will focus on CS50 coding style - including consistent formatting, selection of identifier names, and use of meaningful comments - in addition to correctness, testing, and documentation.
Your C code must compile without producing any compiler warnings. You will lose points if the compiler produces warnings when using our CS50-standard compiler flags.
If your submitted code fails to compile, or triggers a segmentation fault, you will fail all/some of our correctness tests. Write defensive code: each function should check its pointer parameters for NULL, and take some appropriate (safe) action. Write solid unit tests, test drivers, and use regression testing as your development proceeds.
If valgrind reports memory errors or memory leaks, when querier exits normally, you will lose points.
Preparation
-
Start with the same repository you used for Lab 5. Before you begin, make sure you submitted Lab 5 correctly, and pushed the tag to your GitHub remote.
-
Check to ensure your local repo is clean with
make clean
and everything has been committed and pushed according togit status
. -
Do not proceed if you have uncommitted changes or unpushed commits. Seek help if you need to sort out your repo or GitHub.
Assignment
Write the third sub-system of the Tiny Search Engine, the querier. Your design and implementation must follow the Querier Requirements Spec (aka “the Specs”), and make good use of our abstract data structures.
As you create files and directories, add them to your git repository.
Important: do not add any compiled files to git.
In short: your repo should not contain any data files, editor backup files, core dumps, or files built with make
and removed by make clean
.
Commit and push at least once a day.
In your tse-xxxx
directory,
-
Review the
README.md
file; add any top-level comments you need to communicate to the grader. -
Review
Makefile
; you may need to uncomment the lines forquerier
.
In the querier
subdirectory,
-
Write a file
DESIGN.md
to provide the Design Spec for querier, drawing inspiration from the lecture notes. -
Write a file
IMPLEMENTATION.md
to provide the implementation spec and testing plan for querier. See lecture notes about what should be in an implementation spec. -
Write a
README.md
file to provide any special information about how to compile or test your querier, any assumptions you made while writing the querier, and indicating how much of the functionality you implemented, as described below. -
Write a program
querier.c
according to the Specs and your Design. - Your program shall leverage common code from
common.a
in the../common
directory:- a module
index.c
that contains all the logic for saving and loading index files (this module is common between the Indexer, Querier, and indextest); - a module
pagedir.c
that contains all the logic for saving pages to a crawler output directory, for loading pages from a crawler output directory (this module is common between the Crawler and Indexer); - a module
word.c
that implementsNormalizeWord
(this module is common between the Indexer and Querier); - a
Makefile
that buildscommon.a
for use by the programs; - a short
README.md
file explaining how to buildcommon.a
; - a
.gitignore
file telling Git to ignore the binary files (object files) in this directory.
- a module
- Write a
Makefile
that will, by default, build thequerier
executable program. - Add a
make clean
target that removes files produced by Make or your tests. - Add a
make test
target that tests your querier. - Write a
testing.sh
bash script that can be run bymake test
. This script must include good comments describing your tests. For best results,make test
should runbash -v testing.sh
. -
Save the output of your tests with
make test &> testing.out
. - Write a
.gitignore
file telling Git to ignore the binary files (object files) and other unnecessary files in this directory.
Your submitted repo should now include the following files, plus any others you created for testing:
├── .gitignore
├── Makefile
├── README.md
├── common # added by you (Lab4)
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── index.c # added by you (Lab5)
│ ├── index.h # added by you (Lab5)
│ ├── pagedir.c
│ ├── pagedir.h
│ ├── word.c # added by you (Lab5)
│ └── word.h # added by you (Lab5)
├── crawler # added by you (Lab4) except REQUIREMENTS, DESIGN
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── testing.sh
│ ├── testing.out
│ └── crawler.c
├── indexer # added by you (Lab5) except REQUIREMENTS
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── indexer.c
│ ├── indextest.c
│ ├── testing.sh
│ └── testing.out
├── libcs50
│ ├── .gitignore
│ ├── Makefile
│ ├── README.md
│ ├── bag.c
│ ├── bag.h
│ ├── counters.c # if you decided to add your Lab3 solution
│ ├── counters.h
│ ├── file.c
│ ├── file.h
│ ├── file.md
│ ├── hashtable.c # if you decided to add your Lab3 solution
│ ├── hashtable.h
│ ├── jhash.c
│ ├── jhash.h
│ ├── libcs50-given.a
│ ├── memory.c
│ ├── memory.h
│ ├── memory.md
│ ├── set.c # if you decided to add your Lab3 solution
│ ├── set.h
│ ├── webpage.c
│ ├── webpage.h
│ └── webpage.md
├── querier # added by you (Lab6) except REQUIREMENTS, fuzzquery
│ ├── .gitignore
│ ├── REQUIREMENTS.md
│ ├── DESIGN.md
│ ├── IMPLEMENTATION.md
│ ├── Makefile
│ ├── README.md
│ ├── fuzzquery.c
│ ├── querier.c
│ ├── testing.sh
│ └── testing.out
You shall NOT modify any of the code we provide in libcs50
.
You shall NOT commit any data files produced by the crawler, any binary/object files produced by the compiler, backup files, core dumps, etc.
Submission - how
When you are ready for final submission,
- Add all required files using an appropriate
git add
commandline. - Commit all changes using an appropriate
git commit
commandline. - Tag:
git tag lab6submit
- Push:
git push --tags
to ensure the tags are pushed to the remote. See more about tags.
Use your web browser to view your remote on GitHub and make sure there is a tag lab6submit
and the files associated with that tagged commit include everything we’ll need to grade your lab.
We grade the version tagged lab6submit
so it is safe for you to continue work (e.g., so you can start work on Lab 6), even if you commit or push.
We will start grading when we first see one with tag lab6submit
, and judge it late if that tag was added after the deadline.
To avoid confusion, please email the graduate TA and the instructor if you want a late submission to be graded.
Your entire codebase must compile with make
from the top-level directory and produce no compiler warnings.
We will run valgrind
on your program to look for memory leaks; you will lose points for memory leaks or memory errors.
Grading
Lab 6 is scored on the basis of 100 points, with Delivery, Documentation, Style, Testing comprising most of the points.
“Functionality” represents 40/100 points. In recognition that you might find it best to start simple and slowly enhance your solution as you get the simpler version working, you can earn points as follows:
- 10 points if your querier prints the set of documents that contain all the words in the query; you may treat operators (‘and’, ‘or’) as regular words.
- 20 points if your querier also supports ‘and’ and ‘or’ operators, but without precedence (in mathematical terms, it treats them as left associative, equal precedence operators).
- 30 points if your querier also supports ‘and’ precedence over ‘or’.
- 40 points if your querier also prints the document set in decreasing order by score, thus meeting the full specs.
Partial credit is available, of course, per the judgement of the grader, but above is the coarse-grain rubric.
Please indicate in your querier/README.md
which of the above subsets of functionality you implemented.
Hints and tips
Many of the Lab4 hints and Lab5 hints are still relevant.
Processing a query and ranking the results are tricky. We encourage you to start with a simplified case, test it thoroughly, then enhance. Easier to code, to test, and to debug, and when facing a deadline it’s nice to have a less-functional program that works than a full-functional program that doesn’t work. See the above section on Grading regarding the points allocated as you achieve higher levels of functionality.
Review the lecture notes
There are some examples and design tips in the lecture notes.
Hashtable
How big should your hashtable be? Well, you can know how many words it will store - because the index file has one word per line, and you can count the number of lines in the index file before creating an index data structure and before loading the file into the structure. Just think about how the hash-table size (in slots) might relate to the number of words it will need to store.
Parsing queries
We strongly recommend that your code read the entire query (a line of input) into a single string, then tokenize the query string. That is, you should write a function that takes a string and builds an array of words; it should use white space (space or tab) as the delimiter; each word can be normalized (lower-cased) and checked for illegal characters before being added to the array. See one approach at bottom.
Now that all the character-by-character parsing is behind you, and you have an array of words, you can step through the array to print a clean query, that is, with no extraneous spaces and all letters in lower case.
You can then step through the array according to the structure defined in the BNF. Two tips:
- Validate the basic structure: neither the first or last word may be an operator, and two operators may not be adjacent. If valid, proceed to next step; otherwise print a suitable error message.
- Structure your code to follow the structure of the grammar, which has two non-terminals (
query
andandsequence
), to write a function that handles each.- One function to return set of documents that satisfy an
andsequence
, looping over words in theandsequence
; accumulate an answer (like a running product) as you go. - Another function to return set of documents that satisfy a
query
, looping over calls to the above function for eachandsequence
; accumulate an answer (like a running total) as you go.
- One function to return set of documents that satisfy an
See Lecture notes for more hints about how this might work.
Combining results
Suppose you have one counters
object representing the set of documents in which a given word appears, and another counters
object representing the set of documents in which another word appears; each counter set is really a set of (docID, count) pairs.
How do you combine them?
Recall Lecture.
If you are processing wordA AND wordB
, the set of documents that match both words is the intersection of the two sets, and the score for each document (per the specs) is the minimum of the count for each document.
So you need a way to intersect two counters
; we recommend iterating over one set and, for each item, checking whether that document exists in the other set; update the first set according to what you find.
You can use counters_iterate
, counters_get
, and counters_set
to do this, without modifying your counters
module.
If you are processing wordA OR wordB
, the set of documents that match either word is the union of the two sets, and the score for each document (per the definition above) is the sum of the count for each document.
So you need a way to union two counters
; we recommend iterating over the second set and, for each item, checking whether that document exists in the first set; update the first set according to what you find.
Again, you can use counters_iterate
, counters_get
, and counters_set
to do this, without modifying your counters
module.
While processing an andsequence
you can compute a ‘running product’, that is, a counters
object that represents the intersection of everything seen so far in the sequence.
When processing a query, that is, a disjunction of andsequence
results, you can similarly compute a ‘running sum’, that is, a counters
object that represents the union of everything seen so far in the sequence.
Ranking results
After parsing and interpreting a query, you will likely have a counters
object representing the score for each satisfying document.
The counters
module does not have a ‘sort’ method or a way to iterate over the items in sorted order.
We suggest you use counters_iterate()
in which the itemfunc
inserts the items into a new array of structs, each struct representing one satisfying document (docID, score).
Use counters_iterate
twice - once to count the number of items in the set so you can allocate an appropriate-sized array, and again to stuff each item into the new array.
If your itemfunc
uses an insertion-sort approach to drop the new item into the array so the array is sorted in decreasing-score order, you end the iteration with a nicely sorted array of structs.
Then you can simply loop over the array to print out the list of documents.
Testing your querier
Your querier reads queries from stdin, one per line.
You can test it interactively, but to do thorough and repeated testing you can write a collection of little files, each of which contains one or more queries to your querier, and run commands like ./querier < testquery
.
You might write a short bash script to run the querier through several such test files.
That script might even compare the output to known-good output, for regression testing.
We provide fuzzquery.c
, a fuzz-testing program, for testing your querier.
Its use is optional.
If your indexer never quite worked, never fear. You do not need a working indexer to write or test your querier. Try your querier on the output of our crawler and indexer.
Chopping a string into array of strings
Here’s a nice trick for breaking a single string char* line
into an array of strings char* words[]
.
In the example below, the query has two words, and there are leading spaces, separating spaces, and trailing space, terminated by a null.
Because in C a “string” is just a pointer to an array of characters terminated by a null \0
character, we can chop this single string into multiple strings by replacing two of the spaces with nulls, and recording a pointer to the beginning of each word.
We don’t need to allocate any memory for these new strings, or copy these strings - they stay right where they are.
To make this happen, think about sliding two pointers along the array, starting at line
.
One pointer might be char* word
and you slide it over to the first non-space; the other pointer might be char* rest
and you slide it from word
to the first non-letter.
Squash that character *rest
with a null, and you’ve created a null-terminated string starting at word
.
Think carefully about the edge cases, as you construct the loops and slide the pointers.
ctype
You may find the functions isalpha()
and isspace()
useful; read the man page.
They are from the ctype.h
package.
Turning off the prompt
I found it useful to print a prompt for an interactive user (stdin is a “tty”, aka keyboard), but not print a prompt when the stdin is not a keyboard (it makes my test-output look nicer). I wrapped it in a little function to abstract the details:
#include <unistd.h> // add this to your list of includes
/* The fileno() function is provided by stdio,
* but for some reason not declared by stdio.h, so declare here.
*/
int fileno(FILE *stream);
static void
prompt(void)
{
// print a prompt iff stdin is a tty (terminal)
if (isatty(fileno(stdin))) {
printf("Query? ");
}
}