CS 50 | Software Design and Implementation

The querier supports queries that combine words with the and and or operators. Let’s think about how one evaluates an expression involving operators. Consider an arithmetic analogy.

Arithmetic expressions

Consider the following arithmetic expression:

sum = a + b + c + d

Since addition is a left-associative operator, this means the same thing as

sum = (((a + b) + c) + d)

This means we can scan the expression from left to right, accumulating a sum as we go, effectively like this:

sum = 0
sum = sum + a
sum = sum + b
sum = sum + c
sum = sum + d

Here, the sum acts as an accumulator. (Indeed, many early hardware architectures include an explicit register called an ‘accumulator’.)

We often see this approach generalized in code:

int n = 5;
int array[n] = {42, 34, 12, -5, 19};
int sum = 0;
for (int i = 0; i < n; i++)
	sum += array[i];
printf("sum = %d; average = %f\n", sum, (float) sum / n);

Precedence

What if you have a mixture of operators, with precedence?

Consider the following arithmetic expression:

sum = a + b * c + d

Both addition and multiplication are left-associative operators, but multiplication takes precedence over addition. Thus, we implicitly rewrite the above expression as follows:

sum = ((a + (b * c)) + d)

or, in sequence,

sum = 0
sum = sum + a 
prod = 1
prod = prod * b
prod = prod * c
sum = sum + prod
sum = sum + d

Notice how we ‘step aside’ from the sum for a moment while we compute the product b * c … using an exactly analogous process. prod is an accumulator for the product; it is initialized to the multiplicative identity (1) instead of the additive identity (0), for reasons I hope are obvious. But then we just multiply in each of the successive items, one at a time.

This generalizes to longer expressions like

sum = a * b + c * d * e + f + g * h * i

becomes

sum = 0
prod = 1
prod = prod * a
prod = prod * b
sum = sum + prod
prod = 1
prod = prod * c
prod = prod * d
prod = prod * e
sum = sum + prod
prod = 1
prod = prod * f
sum = sum + prod
prod = 1
prod = prod * g
prod = prod * h
prod = prod * i
sum = sum + prod

Let’s add some indentation to make this a little easier to read:

sum = 0
	prod = 1
	prod = prod * a
	prod = prod * b
sum = sum + prod
	prod = 1
	prod = prod * c
	prod = prod * d
	prod = prod * e
sum = sum + prod
	prod = 1
	prod = prod * f
sum = sum + prod
	prod = 1
	prod = prod * g
	prod = prod * h
	prod = prod * i
sum = sum + prod

Notice what I did with f, and that I never add anything to sum other than prod.

This structure should give you a hint about how you might write code to evaluate such expressions… if you have a product function to scan the expression left to right from a given starting point, accumulating a product of individual items until it sees a + or the end of the expression, you can then write a function sum that scans the expression left to right from the start, accumulating a sum of products by calling product at the start and after each +.

Query expressions

Let’s map this new insight onto the TSE query syntax. That syntax combines words with two operators: and, or. Each operator is left-associative, just like multiply and add. The and operator takes precedence over the or operator, just like multiply takes precedence over add. Although the syntax allows and to be omitted, you should mentally insert an and wherever an operator is missing.

Of course, we’re not doing arithmetic; we’re querying a document index for documents that contain words. The result of a single-word query is the set of all documents which contain that word - a pointer to a counter_t counter set. In a multi-word query expression, the “value” of each word is the set of documents matching that word, and the and/or operators manipulate those sets: the and operator computes a set intersection, and the or operator computes a set union.

The same accumulator idea works here, too. For a sequence of words connected by and,

result = A and B and C       # original query
result = ((A and B) and C)   # rewritten, left-associative

result = A            # there is no "intersect identity", so just assign
result = result ^ B   # here I use ^ to represent set intersection
result = result ^ C

For a sequence of words connected by or,

result = X or Y

result = {}           # empty set, the "union identity"
result = result v X   # here I use v to represent set union
result = result v Y

And for a more complex query we need to leverage precedence:

result = X or A and B and C or Y        # original query
result = ((X or (A and B and C)) or Y)  # rewritten, per precedence

result = {}
result = result v X
	temp = A
	temp = temp ^ B
	temp = temp ^ C
result = result v temp
	temp = Y
result = result v temp	

Of course, there may be many and sequences, all within one or sequence:

result = A and B or P and Q or R and S and T or Z

result = {}
	temp = A
	temp = temp ^ B
result = result v temp
	temp = P
	temp = temp ^ Q
result = result v temp
	temp = R
	temp = temp ^ S
	temp = temp ^ T
result = result v temp
	temp = Z
result = result v temp

Notice what I did with Z, and that I never union anything to result other than temp.

Another explanation

Keep two counters_t, result, which tracks the final result, and temp, where we ‘step away’ to temporarily calculate the outcome of an AND operation. When we encounter an OR, we ‘step back’ and UNION result with temp, storing the result back in result. Remember the query cannot end with an AND or OR.

We follow these rules:

counters_t *result = NULL
counters_t *temp = NULL
Read query one word at a time
    If read a word (not AND or OR)
        find counters for this word in index (index_find(index, word))
        if temp == NULL
            temp = counters for word   //first word in AND sequence
        else
            temp = temp ^ counters for word  //intersect on AND
    else if read OR
        result = result v temp //union on OR
        temp = NULL
    else if read AND
        continue to next word  //implicit AND between words
Return result v temp //union

Notice that result never unions with anything other than temp as in the example above. Instead we ‘step away’ to calculate temp and union with result on OR (and also as the final step).

Example query: A and B or P and Q or R and S and T or Z

Read	result	temp	Notes
	NULL	NULL	initialize result, temp
A		A	temp set to counters for word
and			continue
B		A ^ B	intersect A with B
or	NULL v (A ^ B)	NULL	result = result v temp; set temp = NULL
P		P	temp set to counters for word
and			continue to next word
Q		P ^ Q	intersect P with Q
or	(A ^ B) v (P ^ Q)	NULL	result = result v temp; set temp = NULL
R		R	temp set to counters for word
and			continue to next word
S		R ^ S	intersect R and S
and			continue to next word
T		(R ^ S) ^ T	intersect (R ^ S) with T
or	(A ^ B) v (P ^ Q) v (R ^ S ^ T)	NULL	result = result v temp; set temp = NULL
Z		Z	temp set to counters for word
end	(A ^ B) v (P ^ Q) v (R ^ S ^ T) v Z	NULL	result = result v temp; set temp = NULL

Return result.