Info retrieval

[prev] [next]

Information retrieval entails searching through some kind of data store to find things relevant to some query. It's a ubiquitous problem, from the web on down. How do these things work on the back-end? We need some way to keep track of all the information that could be retrieved, in a way that makes the retrieval relatively efficient. You've probably seen such data structures before — Sets and Maps — so we'll quickly cover their API and start building up pieces of a simple information retrieval system, motivated by searching through some files for some query words.

Sets
Maps
Search
Implementations

All the code files for today: Search.java; UniqueWords.java; UniqueWordsCounts.java; UniqueWordsPositions.java; UniqueWordsPositionsFile.java; WordCountComparator.java; shakespeare.zip.

Slides from class

Sets

A set is just a collection of things. How is that different from a list? A list has a linear order — there's a 0th object, then a 1st, .... A set has no order. Furthermore, an item can appear multiple times in a list but only once in a set (since without any order, how would multiple copies be distinguished?) So the main things we can do with a set are to add objects, and see what's in there. If you're familiar with sets from math, additional operations such as union, intersection, etc. can then be provided on top of this core functionality.

Moving to a more formal ADT, the primary operations on a Set<E> are:

boolean add(E e)
Adds the specified element to this set if it is not already present. Returns true if it is a new addition.
boolean contains(E e)
Returns true if this set contains the specified element.
boolean isEmpty()
Returns true if this set contains no elements.
Iterator<E> iterator()
Returns an iterator over the elements in this set.
boolean remove(E e)
Removes the specified element from this set if it is present.
int size()
Returns the number of elements in this set (its cardinality).

All of these methods are also part of the List interface. So why have a separate interface? One reason is semantics. This clarifies to users of a set instance that there's not really an order, and that it's a collection of unique objects. The other reason, as we'll see below, is efficiency. Note that one important aspect of Lists is missing — we can't get/set by position. There's no such thing as position in a set; it's an unordered collection. All we can do is loop over the elements in some order (for-each).

We'll discuss implementation later; let's just use one for now for quick demo.

Set<Integer> numbers = new TreeSet<Integer>();
numbers.add(0);
numbers.add(0);
numbers.add(1);
numbers.add(1);
numbers.add(2);
numbers.add(2);
numbers.add(3);
numbers.add(3);
System.out.println(numbers);       // [0, 1, 2, 3]
System.out.println(numbers.contains(3)); // true
System.out.println(numbers.contains(4)); // false

As this demo illustrates, a set makes it really easy to keep just a single copy of each unique object. So, for example, if we wanted to represent a file (such as a book or a web page) as just a collection of its words (for comparison to other files, say), we just loop over them and add them to the set. No other pain on our part. Code: UniqueWords.java.

However, we might want not just the unique words, but their numbers of occurrences — documents with similar usage frequencies are more similar. That brings us to Map.

Maps

The Map interface describes a way to associate each element (unique, as in sets) with some data. The elements act as keys, by which we can retrieve values. A key is typically something like your student ID number, and the associated data might be your student record. (The idea of a key:value pairing should sound familiar, and you can probably already imagine one implementation of the interface, which we'll come to below.)

The primary operations in a Map<K,V> are:

boolean containsKey(K key)
Returns true if this map contains a mapping for the specified key.
boolean containsValue(V value)
Returns true if this map maps one or more keys to the specified value.
V get(K key)
Returns the value to which this map maps the specified key.
boolean isEmpty()
Returns true if this map contains no key-value mappings.
Set<K> keySet()
Returns a set view of the keys contained in this map.
V put(K key, V value)
Associates the specified value with the specified key in this map. Returns the previous value associated with key, or null if key was not in the map.
V remove(K key)
Removes the mapping for this key from this map if it is present. Returns the value associated with key (or null if key is not in the map).
int size()
Returns the number of key-value mappings in this map.

So now let's extend our unique word identifier to keep track of the counts. UniqueWordsCounts.java. If we've seen the word before, increment its count; else add it to the map.

What if we want to keep a one-to-many mapping? That is, each key keeps track of multiple values? For example, we could keep track of where the various instances of a word occur. Make the value be a list or set itself. UniqueWordsPositions.java. If the word isn't in there yet, we create a new list, containing just this position, and add the list to the map. Otherwise we just stick the position into the existing list.

Finally, to set up for the next, more-complicated step, let's go ahead and load the text from a file. The Java I/O machinery is quite powerful, but for now, let's just do what we need for this, which is based on a Reader class.

private static String loadFileIntoString(String filename) throws Exception {
  BufferedReader in = new BufferedReader(new FileReader(filename));
  String str = "", line;
  while ((line = in.readLine()) != null) str += line;
  in.close();
  return str;
}

We open a new FileReader that will get its input from a text file, and then package that reader up into a BufferedReader so that we can get lines of text at a time, rather than just individual characters. The "while" loop is a bit funky. It has the usual structure of "while (condition) { do something; }", but what's with the condition? This is a canonical way of writing the condition for I/O in C-like languages, actually combining a statement "line=in.readline()" to get the next line, with a test "line != null" to see whether there was a line, or whether we reached the end of the file.

This code snippet can be packaged into the same basic class as before, yielding UniqueWordsPositionsFile.java, which works on the file text.txt (in the "inputs" folder). Note that the "loadFileIntoString" method can throw an exception, because maybe we gave it a bogus filename or something. Thus the "main" method can also throw an exception, passing the buck because we didn't try/catch.

Search

Okay, now let's do some search: Search.java. As an example of something we could search over, I grabbed a few example Shakespeare plays from Project Gutenberg and stripped them down to the public domain text: shakespeare.zip. (Unzip it and put the files in "inputs/shakespeare".) So we might want to be able to see which play talks most about "witches" or which play talks most about both "love" and "death". We'll keep this simple and avoid language processing issues, so plurals and singulars will be different words, as will different verb tenses. And forget about differences in the English language.

The core idea is to build an index, for each document, of the word counts. We already know how to do that in general; it's slightly modified here to work for each file. We associate each word count map with its filename, so we have a map from filename to (a map from word to count). We collect some useful information over the files: the total number of times each word appears, and the total number of files in which each word appears.

The command-line input lets us inspect the word counts, by entering "# n" to list the top n words over all files, "# -n" for the bottom n words. Or we can restrict to a file by giving its filename (include the ".txt" part), "# n filename.txt" or "# -n filename.txt". The method gets a list of the entries from the map, i.e., the (word,count) pairs. It then uses Collections.sort to sort the list, with a custom comparator. This works much like having a class implement Comparable, but for a class that already exists. We provide an auxiliary class to do the work on behalf of the original class. This auxiliary class provides the compare() method by which entries are sorted — first compare the counts, and then the words themselves to break ties.

To search() for a given word, we simply check each file's map to see if it's there. (On your own, see if you can produce the results in sorted order.)

A set of words is treated here as an "AND" (it's not a phrase). So we want to find only the files that have all the words. This can be done by set intersection (which is the "AND" of sets) — intersect the files that have the first word with those that have the second, etc. (An efficiency hack would only consider the files with the first word when looking up the second word, etc.)

How would we rank the files that come back with multiple words? One approach would be just to sum up the frequencies. So a document with "love" appearing 151 times and "death" appearing 74 times would have a score of 225. The problem with this is that some words are just more common than others. Searches for "my cousin" would be skewed by the fact that "my" is very common, and the number of "my"s in a document would drown out the number of "cousin"s, obscuring which file was really more relevant. One simple way to handle this is the "tf-idf" model (term frequency - inverse document frequency). The "tf" part is exactly what we've been doing (though there are variations), counting the number of occurrences of the term. The "df" part weights each term frequency by a notion of how common it is. Once such weight is the log of the fraction (# documents total / # documents with the word). So if the word appears in all documents, its weight is just one. But if appears in only one of N documents, its weight is log N, a fraction greater than 1.

The driver separates out the search for one word vs. the search for a set of words, and in the latter case prints the tf-idf score as well as the components that make it up. It uses the console to get input, rather than a GUI. A Scanner for System.in lets us read the next line from the console. Ex:

Scanner in = new Scanner(System.in);
while (true) {
  System.out.print(">");
  String line = in.nextLine();
  System.out.println(line + " to you, too");
}

Implementations

Now that we've used Map and Set, let's think a bit about implementing them.

For Map, you're hopefully already thinking of a binary search tree as a natural way to implement the interface. And you're right. But note that the interface also has a method containsValue, which our original BST didn't support. One possibility: when called, go throughout the tree, and see if any node has that value. Linear time. Another: maintain the values in some kind of data structure, updating with each put. Note that values are not necessarily unique. Set gets rid of the redundancy and lets us quickly answer the query, but when we replace a value, we don't know whether or not to remove it from the set (maybe some other key still has it, or maybe not). So actually need to keep the counts, via another map. Then we have a log time test.

Another, simple implementation is as a list. The elements of the list are (key,value) pairs. The list could be in sorted order for quicker lookup but slower insertion, or in unsorted order for quicker insertion but slower lookup.

How about set? Again, a list is easy, with unsorted faster for insertion and sorted faster for lookup. Can we do something logarithmic? Just store the items as keys in a BST. Don't worry about the values — all we need is whether or not the item is in the set.

We'll see in another class a different way to implement these, using hash tables.