CS 10: Spring 2014

Lecture 19, May 7

Lab Assignment 3

Lab Assignment 3 is due next Wednesday, May 14.

Breadth-first search on a graph

One question we might ask about a graph is how few edges we need to traverse to find a path from one vertex to another. In other words, assuming that some path exists from vertex x to vertex y, find a path from x to y that has the fewest edges.

One application is the famous Kevin Bacon game, which you will implement in Lab Assignment 3. Consider a movie actor, say Renée Adorée, a silent film actress from the 1920s. She appeared in a film with Bessie Love, who made a movie with Eli Wallach, who acted in a film with Kevin Bacon, and so we say that Renée Adorée's "Kevin Bacon number" is 3. If we were to make a graph where the vertices represent actors and we put an edge between vertices if their actors ever made a movie together, then we can use breadth-first search to find anyone's Kevin Bacon number: the minimum number of actors between a given actor and Kevin Bacon.

Mathematicians have a similar concept, called an Erdős number. Paul Erdős was a great Hungarian mathematician. His Erdős number is 0. Mathematicians who coauthored a paper with Erdős have an Erdős number of 1. Mathematicians other than Erdős who coauthored a paper with those whose Erdős number is 1 have an Erdős number of 2. In general, a mathematician who coauthored a paper with someone whose Erdős number is k and did not coauthor a paper with anyone whose Erdős number is less than k has an Erdős number of k + 1. Believe it or not, there is even such a thing as an Erdős-Bacon number, which is the sum of the Erdős and Bacon numbers. A handful of people, including Erdős himself, have finite Erdős-Bacon numbers.

I'll describe breadth-first search, which we abbreviate as BFS, for you, but I'll give you no code. That's because you'll be implementing BFS in Lab Assignment 3.

We perform BFS from a given start vertex s. We sometimes call s the root vertex. We can either designate a specific goal vertex, or we can just search to all vertices that are reachable from the start vertex (i.e., there exist paths from the start vertex to the vertices). We want to determine, for each vertex x that the search encounters, an unweighted shortest path from s to x, that is, a path from s to x with the minimum length (fewest edges). Later in the course, when we consider weighted graphs, we'll extend the notion of shortest paths to account for edge weights.

We usually record two pieces of information about each vertex x:

  1. The length of an unweighted shortest path from s to x, as the number of edges.

  2. A back-pointer from x, which is the vertex that immediately precedes x on a shortest path from s to x. For example, let's look at that graph again:

    In this graph, an unweighted shortest path from vertex D to vertex A is D, I, C, A, and so A would have a back-pointer to C, C would have a back-pointer to I, and I would have a back-pointer to D. Because vertex D is the start vertex, no vertex precedes it. We can say that its back-pointer is null (depending on the type of the back-pointer), or we can say that D is its own back-pointer, so that a vertex that is its own back-pointer must be the start vertex.

    This unweighted shortest path is not unique in this graph. Another unweighted shortest path from vertex D to vertex A is D, B, E, A. In this path, A would have a back-pointer to E, E would have a back-pointer to B, B would have a back-pointer to D, and D's back-pointer would be None.

I like to think of BFS in terms of sending waves of runners over the graph in discrete timesteps. We'll focus on undirected graphs. I'll call the start vertex s, and at first I won't designate a goal vertex.

If there is a specific goal vertex, you can stop this process once it has been visited.

In the above graph, let's designate vertex D as the start vertex, so that D's distance is 0 and its back-pointer is null. At time 1, runners visit vertices G, B, and I, and so each of these vertices gets a distance of 1 and D as their back-pointers. At time 2, runners from G, B, and I visit vertices D, G, B, I, F, E, C, and H. D, G, B, and I have already been visited, but F, E, C, and H all get a distance of 2. F and E get B as their back-pointer, and C and H get I as their back-pointer. At time 3, runners from F, E, C, and H visit vertices B, A, and I. B and I have already been visited, but A gets a distance of 3, and its back-pointer is set arbitrarily to either of E and C (it doesn't matter which, since either E or C precedes A on an unweighted shortest path from D). Now all vertices have been visited, and the breadth-first search terminates.

To determine the vertices on a shortest path, we use the back-pointers to get the vertices on a shortest path in reverse order. (But you know how easy it is to reverse a list, and sometimes you're fine with knowing the path in reverse order, such as when you want to display it.) Let's look at the vertices on an unweighted shortest path from D to A. We know that A is reachable from D because it has a back-pointer. So A is on the path, and it is preceded by its back-pointer. Let's say that A's back-pointer is E (which I chose arbitrarily over C so that I wouldn't get a reversed path that spells "ACID"). So E is the next vertex on the reversed path. E's back-pointer is B, and so B is the next vertex on the reversed path. B's back-pointer is D, and so D is the next vertex on the reversed path. D's back-pointer is null, and so we are done constructing the reversed path: it contains, in order, vertices A, E, B, and D.

Implementing BFS

You can't implement BFS by following the above description exactly. That's because you can't send out lots of runners at exactly the same time. You can do only one thing at a time, and so you couldn't send runners out along all edges incident on vertices G, B, and I all at the same time. You need to simulate doing that by sending out one runner at a time.

Let's define the frontier of the BFS as the set of vertices that have been visited by runners but have yet to have runners leave them. If you send out runners one at a time, at any moment, the vertices in the frontier are all at either some distance i or distance i + 1 from the start vertex. For example, suppose that we have sent out runners from vertices D, G, and B so that they are no longer in the frontier. The frontier consists of vertex I, at distance 1, and vertices F and E, at distance 2.

One key to implementing BFS is to treat the frontier properly. In particular, you want to maintain it as a FIFO queue.

Here is "pseudocode" (i.e., not real Java code) for BFS:

queue = new Queue();
queue.enqueue(s);  // initialize the queue to contain only the start vertex s
distance of s = 0;
while (!queue.isEmpty()) {
  x = queue.dequeue(); {  // remove vertex x from the queue
  for (each vertex y that is adjacent to x) {
    if (y has not been visited yet) {
      y's distance = x's distance + 1;
      y's back-pointer = x;
      queue.enqueue(y);   // insert y into the queue
    }
  }
}

One property of BFS is that at any time, the distances of vertices on the queue are all either d or d + 1, for some nonnegative integer d, with all the vertices at distance d preceding those at distance d + 1. We can show this property by induction on the number of enqueue and dequeue operations. At first, we have just one vertex on the queue, and its distance is 0. Now let's consider the two operations, with dequeue first. If the property holds before dequeueing, then surely removing a vertex from the queue cannot cause a vioation. Now let's consider enqueueing. Suppose that the most recently dequeued vertex had distance d. Because we assume that the property holds, after dequeueing, the first vertex in the queue must have distance d or d + 1. Any vertex enqueued has distance d + 1 (1 greater than the distance of the dequeued vertex), and because enqueued vertices follow all others in the queue, the property holds after enqueueing.

Now we have a couple more questions to answer:

Representing vertex information

Again, you have a couple of choices. If you create a class, say Vertex, to represent what you need to know about a vertex, it can have whatever instance variables you like. So a Vertex object could have instance variables distance, backPointer, and even visited. Of course, you could make visited unnecessary by initializing distance to Integer.MAX_VALUE and saying that the vertex has been visited only if its distance variable is not Integer.MAX_VALUE. (Of course, this idea works only if it's impossible for an actual distance to be Integer.MAX_VALUE.) In Lab Assignment 3, you do not necessarily need to keep track of distances.

This approach is nice because it encapsulates what you'd like to know about a vertex within its object. But this approach has a downside: you need to initialize the instance variables for every vertex before you commence send out runners from the start vertex. You wouldn't want to use, say, the visited value from a previous BFS, right? Ditto for the distance value.

There's another way: use a Map. If you want to store back-pointers, store them in a Map, where the key is a reference to a Vertex object and the value is a reference to the Vertex object for its immediate predecessor on an unweighted shortest path from the start vertex. You can use this same Map to record whether a vertex has been visited, because you can say that a vertex has been visited if and only if its Vertex object appears in the Map. You can use the containsKey method from the Map interface to determine whether the Map contains a reference to particular Vertex object. The beauty of using a Map is that initializing is E-Z: just make a new Map. There is one other detail you need to take care of: the start vertex. It has no predecessor, but you want to record that it has been visited initially. So you'll need to store a reference to its Vertex object in the Map, and you can just use null as the corresponding value.

There is yet another way: store the breadth-first tree as a separate directed graph. Each directed edge (u, v) is equivalent to u's back-pointer being v. Looked at another way, each directed edge is oriented on a path toward the start vertex. Every vertex has at most one leaving edge in a breadth-first tree, since every vertex has at most one back-pointer.

How long does BFS take? We visit each vertex and each edge at most once. For a graph with n vertices and m edges, the running time is O(n + m).

Depth-first search

You can think of BFS like a "layered" search, where we visit the start vertex at distance 0, then the next "layer" of vertices at distance 1, then the next layer at distance 2, and so on. We really need the queue to keep track of which vertex we search from next.

Instead, you can search the graph as if it were a maze. That's depth-first search, or DFS. You keep going deeper and deeper in the maze, making choices at the various forks in the road, until you hit a dead-end. Then you back up to one of the choice points and make a different choice. If you're fairy-tale clever, you'll scatter some bread crumbs behind you, so that you can see if you've come back to the same point and are just walking in circles. Then you'd want to do the same thing as if you hit a dead end.

Whereas BFS keeps track of vertices on a queue, DFS uses a stack. We can either use our own stack, or take advantage of the run-time stack and write DFS recursively. I prefer to use the run-time stack, because I find recursive code more natural to understand than maintaining a data structure. So here is recursive pseudocode for DFS:

visit(s);

void visit(u) {
  do whatever it means to start a visit of u;
  for (each vertex v that is adjacent to u)
    if (v has not been visited yet)
       visit(v);
  do whatever it means to finish a visit of u;
}

Now, we can use a Set to keep track of which vertices have been visited. Or we can record depth-first distances, but they're not very interesting.

What's more interesting are discovery times times and finishing times for vertices. Keep a static integer time, and when we start or finish a visit of a vertex, record the current value of time and then increment time:

time = 0;
visit(s);

void visit(u) {
  u's discovery time = time;
  time++;
  for (each vertex v that is adjacent to u)
    if (v has not been visited yet)
       visit(v);
  u's finishing time = time;
  time++;
}

I'll go through an example in class.

Like BFS, DFS visits each vertex and edge at most once, and so its running time is O(n + m).

Why are discovery and finishing times helpful? For one thing, we can use them to perform a topological sort of a directed acyclic graph.

Topological sorting

A directed graph is acyclic if it contains no directed cycles. We call a directed acyclic graph a dag. Sometimes, we want to find a linear order of the vertices of a dag such that all edges in the dag go in the direction of the linear order. That is a topological sort of the dag. For example, if we order all the vertices from left to right, then all directed edges should go from left to right.

Here's a dag that represents dependencies for putting on goalie equipment in ice hockey:

An edge (u, v) indicates that you have to put on article u before putting on article v. For example, the edge from chest pad to sweater says that you have to put on the chest pad before the sweater. On the other hand, some articles have no particular relative order. For example, it doesn't matter whether you put on compression shorts before or after the T-shirt.

You can put on only one article at a time, and so you'd like to find a linear order so that you don't have to undo having put on one article in order to put on another. In other words, you'd like to topologically sort this dag of goalie gear. Here's one possible order, given by the numbers next to the vertices:

So undershorts would go on first, then socks, compression shorts, hose, cup, pants, skates, leg pads, T-shirt, chest pad, sweater, mask, catch glove, and finally the blocker. That's not the only possible order. Here's another: T-shirt, socks, chest pad, undershorts, compression shorts, cup, hose, pants, sweater, mask, skates, leg pads, catch glove, blocker.

Here is an incredibly simple algorithm to topologically sort a dag:

  1. Run DFS on the dag to compute the finishing times for all vertices.
  2. As each vertex is finished, insert it onto the front of a linked list.
  3. The linked list contains the topologically sorted vertices, in order.

This algorithm runs in O(n + m) time. Why? Step 1, DFS, takes O(n + m) time. Step 2, inserting each vertex onto the front of a linked list, takes constant time per vertex, or O(n) time altogether. Step 3 is really just saying that the result of step 2 is the answer. So the entire algorithm takes O(n + m) time.

I'll go through an example for the goalie-gear dag in class.