CS 10: Spring 2014

Lecture 14, April 24

Code discussed in lecture

Heaps, continued

Last time, we started looking at the heap data structure to implement a priority queue. Let's see how to implement one. An implementation of a min-heap in an ArrayList is in HeapMinPriorityQueue.java. It shows how to implement the operations described above. Switching between max-heaps and min-heaps shouldn't throw you.

Let's look at the worst-case running times of the min-priority queue operations in this implementation. We express them in terms of the number n of elements that are in the min-priority queue when the operations occur.

isEmpty just returns a boolean indicating whether the size of the ArrayList is zero. This method takes constant time, or Θ(1).
insert first adds a new reference at the end of the ArrayList, which takes Θ(1) amortized time. It then has to bubble the value up the heap until it is less than its parent. Each swap takes constant time, and the number of swaps is bounded by the height of the heap. Thus, insert takes O(lg n) time.
minimum just returns what is in position 0 of the ArrayList, taking Θ(1) time.
extractMin returns the element in position 0 and puts the last element in its place, taking Θ(1) time. It then has to restore the heap property, however, and so it has to bubble the new root down until it is smaller than both children or is a leaf. Like insert, this procedure takes O(lg n) time.

Heapsort

We can also use a heap as the basis of a sorting algorithm called heapsort. Its running time is O(n lg n), and it sorts in place. That is, it needs no additional space for copying values (as merge sort does) or for a stack of recursive calls (as needed in quicksort and merge sort).

Implementing heapsort

Heapsort has two major phases. You can see all the steps in this PowerPoint presentation. First, given an array of values in an unknown order, we have to rearrange the values to obey the max-heap property. That is, we have to build a heap. Then, once we've built the heap, we repeatedly pick out the maximum value in the heap—which we know is at the root—swap it with the last leaf in the heap, and restore the max-heap property. When we put the maximum value into the array position that had held the last leaf, we consider that array position to no longer be part of the heap.

The code for heapsort is in Heapsort.java. We've written it to sort an array, rather than an ArrayList, but you can easily modify it to sort an ArrayList. Or you can use the overloaded version that takes an ArrayList, converts it to an array, sorts the array, and then copies the sorted array back into the ArrayList. At the bottom, you can see some private methods that help out other methods in the class: swap, leftChild, and rightChild.

How to build a heap

The obvious way to build a heap is to start with an unordered array. The first element is a valid heap. We can then insert the second element into the heap, then the third, etc. After we have inserted the last element, we have a valid heap. This idea works fine and leads to an O(n lg n)-time heapsort. We can avoid implementing the insert code and speed up the algorithm a bit, however, by building the heap from the bottom up rather than from the top down and using the same idea as when we restore the max-heap property during the extractMax operation.

The code to restore the max-heap property is in the maxHeapify method. It takes three parameters: the array a holding the heap and indices i and lastLeaf into the array. The maxHeapify method assumes that, when it is called, if you look at the subarray a[i..lastLeaf] (the subarray starting at index i and going through index lastLeaf), the max-heap property holds everywhere in this subarray, except possibly among node i and its children. maxHeapify restores the max-heap property everywhere in the subarray.

maxHeapify works as follows. It computes the indices left and right of the left and right children of node i, if it has such children. Node i has a left child if the index left is no greater than the index lastLeaf of the last leaf in the entire heap, and similarly for the right child.

maxHeapify then determines which node, out of node i and its children, has the greatest key value, storing the index of this node in the variable largest. First, if there's a left child, then whichever of node i and its left child has the larger value is stored in largest. Then, if there's a right child, whichever of the winner of the previous comparison and the right child has the larger value is stored in largest.

Once largest indexes the node with the largest value among node i and its children, we check to see whether we need to do anything. If largest equals i, then the max-heap property already is satisfied, and we're done. Otherwise, we swap the values in node i and node largest. By swapping, however, we have put a new, smaller value into node largest, which means that the max-heap property might be violated among node largest and its children. We call maxHeapify recursively, with largest taking on the role of i, to correct this possible violation.

Notice that in each recursive call of maxHeapify, the value taken on by i is one level further down in the heap. The total number of recursive calls we can make, therefore, is at most the height of the heap, which is Θ(lg n). Because we might not go all the way down to a leaf (remember that we stop once we find a node that does not violate the max-heap property), the total number of recursive calls of maxHeapify is O(lg n). Each call of maxHeapify takes constant time, not counting the time for the recursive calls. The total time for a call of maxHeapify, therefore, is O(lg n).

Now that we know how to correct a single violation of the max-heap property, we can build the entire heap from the bottom up. Suppose we were to call maxHeapify on each leaf. Nothing would change, because the only way that maxHeapify changes anything is when there's a violation of the max-heap property among a node and its children. Now suppose we called maxHeapify on each node that has at least one child that's a leaf. Then afterward, the max-heap property would hold at each of these nodes. But it might not hold at the parents of these nodes. So we can call maxHeapify on the parents of the nodes that we just fixed up, and then on the parents of these nodes, and so on, up to the root.

That's exactly how the buildMaxHeap method in Heapsort.java works. It computes the index lastNonLeaf of the highest-indexed non-leaf node, and then runs maxHeapify on nodes by decreasing index, all the way up to the root.

You can see how buildMaxHeap works on our example heap, including all the changes made by maxHeapify, by running the slide show in the PowerPoint presentation. Run it for 17 transitions, until you see the message "Heap is built."

Let's analyze how long it takes to build a heap. We run maxHeapify on at most half of the nodes, or at most n/2 nodes. We have already established that each call of maxHeapify takes O(lg n) time. The total time to build a heap, therefore, is O(n lg n).

Because we are shooting for a sorting algorithm that takes O(n lg n) time, we can be content with the analysis that says it takes O(n lg n) time to build a heap. It turns out, however, that a more rigorous analysis shows that the total time to run the buildMaxHeap method is only O(n). Notice that most of the calls of maxHeapify made by buildMaxHeap are on nodes close to a leaf. In fact, about half of the nodes are leaves and take no time, a quarter of the nodes are parents of leaves and require at most 1 swap, an eighth of the nodes are parents of the parents of leaves and take at most 2 swaps, and so on. If we sum the total number of swaps, it ends up being O(n).

Sorting once the heap has been built

The second phase of sorting is the while-loop in the heapsort method in Heapsort.java. After heapsort calls buildMaxHeap so that the array obeys the max-heap property, the while-loop sorts the array. You can see how it works on the example by running the rest of the slide show in the PowerPoint presentation.

Let's think about the array once the heap has been built. We know that the largest value is in the root, node 0. And we know that the largest value should go into the position currently occupied by the last leaf in the heap. So we swap these two values, and declare that the last position—where we just put the largest value—is no longer in the heap. That is, the heap occupies the first n − 1 slots of the array, not the first n. The local variable lastLeaf indexes the last leaf, and so we decrement it. By swapping a different value into the root, we might have caused a violation of the max-heap property at the root. Fortunately, we haven't touched any other nodes, and so we can call maxHeapify on the root to restore the max-heap property.

We now have a heap with n − 1 nodes. The nth slot of the array—a[n-1]—contains the largest element from the original array, and this slot is no longer in the heap. So we can now do the same thing, but now with the last leaf in a[n-2]. Afterward, the second-largest element is in a[n-2], this slot is no longer in the heap, and we have run maxHeapify on the root to restore the max-heap property. We continue on in this way, until the only node that we have not put into the heap is node 0, the root. By then, it must contain the smallest value, and we can just declare that we're done. (This idea is analogous to how we finish up selection sort, where we put the n − 1 smallest values into the first n − 1 slots of the array. We then declared that we were done, since the only remaining value must be the smallest, and it's already in its correct place.)

Analyzing this second phase is easy. The while-loop runs n − 1 times (once for each node other than node 0). In each iteration, swapping node values and decrementing lastLeaf take constant time. Each call of maxHeapify takes O(lg n) time, for a total of O(n lg n) time. Adding in the O(n lg n) time to build the heap gives a total sorting time of O(n lg n).

Collections

Java has a group of interfaces for holding collections of objects and classes that implement them. We have briefly touched up List, which is an interface with two Java-provided implementations: ArrayList and LinkedList. Today will look at two other interfaces for holding collections of objects: Set and Map. Each has two Java-provided implementations. Set is implemented by HashSet and TreeSet. Map is implemented by HashMap and TreeMap. We will be looking at their underlying data structures, hash tables and binary search trees, in the next few lectures.

The `List` interface

Among other methods, the List<E> interface provides the following:

boolean add(E o)
Adds the specified element to the end of the list. Always returns true.
void add(int index, E o)
Inserts the specified element at position index of this list.
void clear()
Removes all of the elements from this list.
boolean contains(Object o)
Returns true if this list contains the specified element, false otherwise.
E get(int index)
Returns the element at position index.
boolean isEmpty()
Returns true if this list contains no elements, false otherwise.
int indexOf(Object o)
Returns the index of the first occurrence of the specified element in the list, or − 1 if the list does not contain the element.
Iterator<E> iterator()
Returns an iterator that goes through the elements in this list in the order that they appear in the list.
ListIterator<E> listIterator()
Returns a list iterator that goes through the elements in this list in the order that they appear in the list.
E remove(int index)
Removes the element at the specified position in this list and returns it.
boolean remove(Object o)
Removes the first occurrence of the specified element from this list if it is present. Returns true if the element is present, false otherwise.
E set(int index, E element)
Replaces the element at the specified position with the given element. Returns the old element.
int size()
Returns the number of elements in this list.

If both ArrayList and LinkedList implement this set of operations, why have both? Efficiencies differ. Access operations (set and get) take constant time in an ArrayList, but require time proportional to the distance to the nearest end for a LinkedList. (The LinkedList is a doubly-linked circular linked list, and it's smart enough to start at the closest end.) On the other hand, modification operations (add and remove at a given index) require time proportional to the number of elements after the index in an ArrayList, because all of these elements must be copied. But for a LinkedList, they take constant time after the time to access the index (distance from nearest end). Therefore, all Iterator or ListIterator operations take constant time for a LinkedList, but add and remove operations take time proportional to the number of remaining elements for an ArrayList.

The `Set` interface

A Set differs from a List in that a List has a linear order, whereas a Set does not. Furthermore, an element can appear multiple times in a List but only once in a Set.

Here are the primary operations on a Set<E>:

boolean add(E o)
Adds the specified element to this set if it is not already present. Returns true if o was not in the set, false if o was already in the set.
void clear()
Removes all of the elements from this set.
boolean contains(Object o)
Returns true if this set contains the specified element, false otherwise.
boolean isEmpty()
Returns true if this set contains no elements, false otherwise.
Iterator<E> iterator()
Returns an iterator over the elements in this set.
boolean remove(Object o)
Removes the specified element from this set if it is present. Returns true if the o was present, false otherwise.
int size()
Returns the number of elements in this set (i.e., its cardinality).

All of these methods are also part of the List interface. So why have a separate interface?

The main reason is implementation efficiency. The contains operation on either an ArrayList or a LinkedList with n elements takes O(n) time, and for an ArrayList the remove operation can take O(n) time. For applications such as a dictionary for a spell checker, these running times are too slow.

There are two implementations of Set in the Java class library. Both implement the contains operation more efficiently than it can be implemented for a List.

The first implementation is TreeSet, which uses a data structure called a balanced binary tree to store the data. You can think of it as a little like a linked list on which you can do binary search. We will talk about this data structure soon. The important point is that the add, remove, and contains methods all take O(lg n) time for a set with n elements. It works only on Comparable objects. The iterator is guaranteed to return the elements in increasing order by compareTo and takes O(n) time to iterate through the entire set. Getting the first element from the iterator takes O(lg n) time.

The second is HashSet, which uses a data structure called a hash table. We will talk about hash tables next time.

If the hash table is used properly, then the add, remove, and contains operations all take O(1) time on average (although it is possible that they could take Θ(n) time if you were extremely unlucky). The iterator returns the elements in a somewhat arbitrary order.

As an example of the use of sets, consider the program SetDemo.java. It creates a set consisting of all of the keywords in Java. It then uses an iterator to go through the set and print each of the words. (Note that an iterator on a Set is identical to an iterator on a List.) Finally, it lets the user type words and determines if they are keywords by using contains to see if they are in the set.

The `Map` interface

The Map interface describes a data structure that can be thought of as a set where each element has associated data. Each data element is associated with a key. By looking up the key, you can get the associated data, just like a dictionary in Python. A key is typically something like your student ID number, and the associated data might be your student record. A Map can be implemented using balanced a binary tree or a hash tables, just like a Set.

Where K is the generic type for the key and V is the generic type for the associated data, The primary operations in a Map<K,V> are the following:

void clear()
Removes all mappings from this map.
boolean containsValue(Object value)
Returns true if this map maps one or more keys that map to the specified value, false otherwise.
V get(Object key)
Returns the value to which this map maps the specified key.
boolean isEmpty()
Returns true if this map contains no key-value mappings.
Set<K> keySet()
Returns a Set containing the keys contained in this map.
V put(K key, V value)
Associates the specified value with the specified key in this map. Returns the previous value associated with key, or null if key was not in the map.
V remove(Object key)
Removes the mapping for this key from this map if it is present. Returns the value associated with key (or null if key is not in the map).
int size()
Returns the number of key-value mappings in this map.

For an example of the use of a map, consider AnimalSounds.java. This program allows the user to insert animal names as keys and the sounds that they make as the associated data. The user can then ask for the sound that a given animal makes, or to remove an animal from the map.

Note the way the the print operation works. The code for this is

if (animalMap.isEmpty())
  System.out.println("The map is empty");
else {
  System.out.println("Here are the animals and their sounds:");

  Set<String> animalNames = animalMap.keySet();
  Iterator<String> iter = animalNames.iterator();

  while (iter.hasNext()) {
    animal = iter.next();
    System.out.println(toTitleCase(getArticle(animal)) + " "
        + animal + " says " + animalMap.get(animal) + ".");
  }
}

Note that the first step is to call keySet to get all of the keys in the map. Then we create iterator for the set, and we use it to iterate through the set, printing each key and the value returned by get for that key.

Bonus coverage: A more complex example: voting

We probably won't have time to get to this example in class, but it's worth going through on your own.

The method of voting in which the candidate with the most votes wins the election has some drawbacks. If two conservatives get in a race against a liberal in a conservative district they could split the conservative vote and the liberal gets elected, even though he is the third choice of the majority of the voters in the election. Also, third parties have a hard time getting established, because voting for a third-party candidate can be throwing away your vote. If about a third of the 22,000 New Hampshire voters who voted for Nader in the Bush-Gore election had voted for Gore instead, he would have won the state and the presidency. Florida, and its hanging chads, would not have mattered.

Some states solve these problems by having a runoff election between the top two candidates if nobody gets a majority of the votes. But a runoff election costs time and money. A popular alternative suggestion is the instant runoff election.

In an instant runoff election, the voters fill out a ballot with an ordered list of candidates, from most favorable to least favorable. The election takes place in rounds. In the first round, each ballot awards a vote to the first candidate on the ballot. If nobody has a majority, then the candidate with the fewest votes is dropped from the election. (In case of ties we will chose one at random.) Then another round is run. This time, each ballot's vote is awarded to the first candidate in its list who has not been eliminated. The bottom candidate is dropped, and the process repeats until one candidate has a majority. (In fact, it can repeat until there is just one candidate left and get the same result. Once someone has a majority they will never be eliminated.)

How could we write a program to determine the winner of an instant runoff election? The first step is to determine what objects appear in the problem and how they interact with one another. One obvious choice is a ballot. We could say, "Oh, that is just a list" and not create an object for it. But let's take an object-oriented approach and say that there should be a Ballot class.

Another object would be the set of all the ballots in the election. We could just say, "Make a set of lists," but let's make Election a class, also.

A final object that might be less obvious is one that represents the results of the voting. Let's create a VoteTally class. The alternative is to use a map from candidate names to the number of votes that they received.

We could have a class to keep track of the current set of candidates, but the Set class seems to do everything that we are likely to need. Unless we discover an action that we need to do that the Set class doesn't handle, we will just use a Set.

What actions do we need to perform? We first need to get our set of candidates. Note that we can limit this set to candidates who get at least one first-round vote. Others will have zero votes and will be dropped before any of the candidates who got first-round votes. It sounds like Election is the class that has access to the data to perform this, with help from the Ballot class to get the first element on each ballot.

Next, we have to run a round of the election. This task requires going through all of the ballots, determining to whom each vote should go, and increasing that candidate's tally by 1. The Ballot class has the data to determine who should get the vote. The Election class has the ballots. The VoteTally class should update itself by adding a vote for the candidate.

After running a round, we have to find the candidate with the fewest votes. The VoteTally class has the information to do so. But what if there is a tie? Maybe we find a list of candidates who share the lowest vote total. We pick one at random and eliminate that candidate from the current candidate set.

We have to repeat running a round of the election and eliminating the candidate with fewest votes until we have only one candidate left. This procedure does not seem to be appropriate for any class. A method in a new class, InstantRunoff, can do this.

So what sorts of things do we want to be able to do with a Ballot object?

Initialize it with the ordered list of candidates. One way to do that is to start with an empty ballot and use an addCandidate method to add candidates to the ballot.
Get the first candidate on the ballot (in order to be able to create the set of initial candidates).
Given the current set of candidates, return the first candidate on the ballot who is in the current set of candidates.

There are many other possible things we could do with a ballot. Getting all of the candidates in order is one possibility, and so we could supply an iterator. A toString method could be useful. A way of getting the number of candidates on the ballot could be useful. But for now we will do the minimum. We can always come back later to add new methods.

What should we do with an Election object? We need to create it, plus perform the jobs mentioned above.

Get the ballots into the Election. A constructor to create an empty Election and an addBallot method could take care of this.
Compute the initial set of candidates.
Count the votes.

What about the VoteTally object?

Record the votes for candidates when asked to do so.
Get the list of losers (the lowest vote-getter in each round).

The code in Ballot.java, Election.java, and VoteTally.java do these operations. The class InstantRunoffOO.java supplies the method to loop through the rounds and the main method for testing.

You can run this code using ballots.txt as input, a file I made with 200 randomly created ballots, but according to a probability distribution. You'll need to modify the string in ballotFileName for your own computer.

An alternate approach is InstantRunoffProc.java. This code does the same thing as InstantRunoffOO.java, but through fixed data structures and static methods. It has less code, which is a plus. There are longer lists of parameters, as all of the data must be passed around "bare." We see data declarations such as List<List<String>> ballots. These declarations are not easy to read and take getting used to. In short, there is no data encapsulation, which is a minus.

In a program this short, encapsulation and data hiding aren't that important. On the other hand, I originally had a Set of ballots instead of a List. The Set of Ballot objects in InstantRunoffOO.java wasn't a problem, because Ballot did not override equals. But in InstantRunoffProc.java, it was a problem, because two ballots with choices "Romney Huntsman" were entered into an election but Romney got only one vote. The two ArrayList objects ended up being equal, so only one was kept in the set. Changing from Set<List<String>> to List<List<String>> required five changes spread out over four methods. Finding all of the appropriate changes in a much bigger program (and avoiding changes where the Set<List<String>> wasn't dealing with ballots and may have been correct as it was) would have been tedious and error-prone. In contrast, making the change in InstantRunoffOO.java required two changes in Ballot.java: declaring the instance variable and initializing it in the constructor. I would have only needed to make those two changes, even if the program had millions of lines.

`Scanner` class

Note the use of the Scanner class in the programs examined today. You can open it on an input stream (usually System.in) or even on a String. Then you can read any type of data. The next method reads the next token as a String. (Recall that tokens are like words, separated by white space. But you can also change the separator. The class is very flexible.) You can also call nextLine, nextInt, nextLong, nextDouble, nextFloat, nextBoolean, nextBigInteger, nextBigDecimal, nextShort, and nextByte. It will read characters from the input and convert them to the corresponding type. There is also a "has" version of each of these that returns true if the next thing in the input can be converted to the corresponding type (hasNext, hasNextLine, hasNextInt, etc.).