CS 50 | Software Design and Implementation

Goals

to introduce the TSE Querier
to learn another form of testing: fuzz testing
to learn about expressions and operator precedence

Activity

Today activity about the design of the querier.

The Querier

The third component of the Tiny Search Engine is the Querier, which reads the index produced by the Indexer and the page files produced by the Crawler, to interactively answer written queries entered by the user.

Our Querier loads the index into memory (a data structure we developed for the Indexer) and then prompts the user for queries. Queries are comprised of words, with optional and/or operators. For example,

computer science
computer and science
computer or science
baseball or basketball or ultimate frisbee

The first two examples are treated identically, matching only documents that have both words - not necessarily together (as in the phrase “computer science”). The third picks up documents that have either word. The fourth matches documents that mention baseball, or basketball, or both “ultimate” and the word “frisbee” (not necessarily together).

Here’s an example run with the provided crawled pages and index file with the letters seed URL at depth 6. The TSE output is also available at the plank server under /thayerfs/courses/22fall/cosc050/workspace/tse/tse-output.

$ loc=/thayerfs/courses/22fall/cosc050/workspace/tse/tse-output/
$ ./querier $loc/letters-depth-6 $loc/letters-index-6 
Query? first and search
Query: first and search 
Matches 2 documents (ranked):
score   1 doc   3: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
score   1 doc   8: http://cs50tse.cs.dartmouth.edu/tse/letters/D.html
-----------------------------------------------
Query? tiny search engine
Query: tiny search engine 
No documents match.
-----------------------------------------------
Query? NOTE we LOWERcase the query
Query: note we lowercase the query 
No documents match.
-----------------------------------------------
Query? spaces      do    not    matter
Query: spaces do not matter 
No documents match.
-----------------------------------------------
Query? non-letter characters are disallowed
Error: bad character '-' in query.
Query? even digits as in cs50
Error: bad character '5' in query.
Query? and
Query: and 
Error: 'and' cannot be first
Query? or
Query: or 
Error: 'or' cannot be first
Query? what about and
Query: what about and 
Error: 'and' cannot be last
Query? what about or
Query: what about or 
Error: 'or' cannot be last
Query? ^D

Let’s study the Requirements Spec (located under querier/ directory of your tse repo) for the Querier, and run some demos.

Fuzz Testing

In a recent lecture we talked about unit testing, and the difference between glass-box testing and black-box testing. Usually, these tests are based on a carefully constructed series of test cases, devised to test all code sequences and push on the “edge cases”.

However, such tests are only as good as the test writer - who must logically study the code (for glass-box testing) or the specs (for black-box testing) to think of the suitable test cases. It’s possible they will miss some important cases.

Another solution, therefore, is fuzz testing, a form of black-box testing in which you fire thousands of random inputs at the program to see how it reacts. The chances of triggering an unconsidered test case is far greater if you try a lot of cases!

Here is a fuzz-testing program for our querier. It generates a series of random queries on stdout, which it then pipes to the querier on stdin. Here’s the core of the fuzz tester:

/**************** generate_query ****************/
/* generate one random query and print to stdout.
 * pull random words from the wordlist and from the dictionary.
 */
static void
generate_query(const wordlist_t *wordlist, const wordlist_t *dictionary)
{
  // some parameters that affect query generation
  const int max_words = 6;        // generate 1..max_words
  const float or_probability = 0.3;   // P(OR between two words)
  const float and_probability = 0.2;  // P(AND between two words)
  const float dict_probability = 0.2; // P(draw from dict instead of wordlist)

  int qwords = random() % max_words + 1; // number of words in query
  for (int qw = 0; qw < qwords; qw++) {
    // draw a word either dictionary or wordlist
    if ((random() % 100) < (dict_probability * 100)) {
      printf("%s ", dictionary->words[random() % dictionary->nwords]);
    } else {
      printf("%s ", wordlist->words[random() % wordlist->nwords]);
    }

    // last word?
    if (qw < qwords-1) {
      // which operator to print?
      int op = random() % 100;
      if (op < (and_probability * 100)) {
        printf("AND ");
      }
      else if (op < (and_probability * 100 + or_probability * 100)) {
        printf("OR ");
      }
    }
  }
  printf("\n");
}

Here’s the output of 10 random queries:

$ ./fuzzquery $loc/letters-index-6 10 0
./fuzzquery: generating 10 queries from 22 words
fourier AND traversal 
this OR the the OR tse computational 
biology playground OR computational 
answers breadth search OR computational OR Mississippians OR fast 
algorithm OR coding eniac the AND home OR breadth 
traversal computational playground coding OR the 
fast 
search the OR fast 
home 
transform OR huffman OR depth AND graph AND transform 

And here’s what happens when we pipe it to our querier:

$ ./fuzzquery $loc/letters-index-6 10 0 | ./querier $loc/letters-depth-6 $loc/letters-index-6
./fuzzquery: generating 10 queries from 22 words
Query: fourier and traversal 
No documents match.
-----------------------------------------------
Query: this or the the or tse computational 
Matches 1 documents (ranked):
score   2 doc   1: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
-----------------------------------------------
Query: biology playground or computational 
Matches 1 documents (ranked):
score   1 doc   9: http://cs50tse.cs.dartmouth.edu/tse/letters/C.html
-----------------------------------------------
Query: answers breadth search or computational or mississippians or fast 
Matches 2 documents (ranked):
score   1 doc   9: http://cs50tse.cs.dartmouth.edu/tse/letters/C.html
score   1 doc   7: http://cs50tse.cs.dartmouth.edu/tse/letters/F.html
-----------------------------------------------
Query: algorithm or coding eniac the and home or breadth 
Matches 2 documents (ranked):
score   1 doc   2: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
score   1 doc   3: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
-----------------------------------------------
Query: traversal computational playground coding or the 
Matches 1 documents (ranked):
score   1 doc   1: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
-----------------------------------------------
Query: fast 
Matches 1 documents (ranked):
score   1 doc   7: http://cs50tse.cs.dartmouth.edu/tse/letters/F.html
-----------------------------------------------
Query: search the or fast 
Matches 1 documents (ranked):
score   1 doc   7: http://cs50tse.cs.dartmouth.edu/tse/letters/F.html
-----------------------------------------------
Query: home 
Matches 9 documents (ranked):
score   2 doc   1: http://cs50tse.cs.dartmouth.edu/tse/letters/index.html
score   1 doc   2: http://cs50tse.cs.dartmouth.edu/tse/letters/A.html
score   1 doc   3: http://cs50tse.cs.dartmouth.edu/tse/letters/B.html
score   1 doc   4: http://cs50tse.cs.dartmouth.edu/tse/letters/E.html
score   1 doc   5: http://cs50tse.cs.dartmouth.edu/tse/letters/G.html
score   1 doc   6: http://cs50tse.cs.dartmouth.edu/tse/letters/H.html
score   1 doc   7: http://cs50tse.cs.dartmouth.edu/tse/letters/F.html
score   1 doc   8: http://cs50tse.cs.dartmouth.edu/tse/letters/D.html
score   1 doc   9: http://cs50tse.cs.dartmouth.edu/tse/letters/C.html
-----------------------------------------------
Query: transform or huffman or depth and graph and transform 
Matches 2 documents (ranked):
score   1 doc   7: http://cs50tse.cs.dartmouth.edu/tse/letters/F.html
score   1 doc   6: http://cs50tse.cs.dartmouth.edu/tse/letters/H.html

We could generate a different series of random queries by changing the random seed, and we can run a lot more queries, too!

The fuzz tester does not test all aspects of the querier; in particular, it will not generate syntactically incorrect inputs. Those should be tested by another program, perhaps another fuzz tester. Furthermore, it does not verify whether the querier actually produces the right answers!

For regression testing, we might save the querier output in a file, and then compare the output of a fresh test run against the saved results from earlier runs. If we had earlier believed those results to be correct, then seeing unchanged output would presumably indicate the results (and thus the new code) are still correct.