In this assignment, you will build an undirected graph whose vertices are movie actors and edges indicate that the actors appeared in the same film. You will then build a breadth-first search (BFS) tree as a directed graph and answer queries about paths in the BFS tree from actors to Kevin Bacon—their "Kevin Bacon numbers." You will get practice with reading files and doing some simple parsing of them, building graphs, running BFS, and traversing paths in a directed graph.
As in the previous two lab assignments, you are permitted to work with one other student currently in the class. You do not have to work with someone else, but you have the option of doing so. Your partner for this assignment does not have to be the same partner you worked with previously. If you choose to work with a partner, you will both get the same grade on the assignment. If you would like to work with a partner and cannot find one, let me know and I'll try to match you up with someone else who is seeking a partner.
There will be no penalty, in terms of points, for working together on this assignment. Please make sure that both of you submit the same exact code via Canvas. If you have a partner, state your partner's name in the comments section of Canvas.
You should weigh whether you will get more out of this assignment working alone or with someone else. The choice is up to you.
If you choose to work with someone else, pick your partner carefully. Make sure that there are times that you are both available to work together. If you frequent the lab and you notice someone else who often is there when you are, that person might be a good choice as your partner.
In the Kevin Bacon game, you are given an actor and try to find the shortest sequence of actors between the given actor and Kevin Bacon, where you list actors consecutively if they appeared in a movie together. For example, the silent film star Renée Adorée was in The Blackbird (1926) with Doris Lloyd, who was in Keep 'Em Flying (1941) with Carol Bruce, who was in Planes, Trains, and Automobiles (1987) with Kevin Bacon. Since that is the shortest sequence of actors between Renée Adorée and Kevin Bacon, we say that Renée Adorée's Kevin Bacon number is 3.
The input to the Kevin Bacon game comes from three files, which are (unfortunately) outdated and not at all complete. They were downloaded from http://www.cs.luther.edu/~bmiller/CS151/Spring05/index.html. (The Oracle of Bacon site maintains a much more up-to-date database internally. They claim to scour the Internet Movie Database every couple of weeks. The example above for Renée Adorée comes from the Oracle of Bacon, However, their files contain 2.9 million actors and 1.9 million movies and TV shows, which is a bit excessive for a lab assignment!)
actors.txt: This file contains one line per actor. Each line consists of an integer ID, a pipe symbol (|
), and an actor's name, with no spaces around the pipe. The file contains 9235 lines.
For testing, we have a much smaller version of this file in actorsTest.txt, consisting of only seven lines. Here is what actorsTest.txt looks like:
1|Kevin Bacon 3|Tom Hanks 841|Meg Ryan 3823|Parker Posey 3590|Lisa Kudrow 2976|W.C. Fields 1534|George Burns
movies.txt: This file contains one line per movie. Each line consists of an integer ID, a pipe symbol, and the name of a movie. The file contains 7067 lines. The smaller version for testing is moviesTest.txt:
150|Apollo 13 (1995) 539|Sleepless in Seattle (1993) 6885|In the Cut (2003) 2424|You've Got Mail (1998) 1160|Six of a Kind (1934) 1875|Clockwatchers (1997)
movie-actors.txt: This file contains one line for each actor's appearance in a movie. Each line consists of a movie ID, a pipe symbol, and an actor ID, indicating that the actor appeared in the movie. The smaller version for testing is movie-actorsTest.txt:
150|1 150|3 6885|1 6885|841 539|3 539|841 2424|3 2424|3823 2424|841 1875|3823 1875|3590 1160|2976 1160|1534
For example, the first line indicates that actor 1 (Kevin Bacon) appeared in movie 150 (Apollo 13 (1995)).
Here is the undirected graph that the test files give:
Vertices are labeled by actor names, and edges are labeled by movie names (I've omitted the years). In this small graph, Tom Hanks and Meg Ryan have Kevin Bacon numbers of 1, Parker Posey's Kevin Bacon number is 2, and Lisa Kudrow's Kevin Bacon number is 3. W.C. Fields and George Burns have infinite Kevin Bacon numbers, since there is no path between them and Kevin Bacon. A query for Lisa Kudrow could produce the following output:
Enter the name of an actor: Lisa Kudrow
Lisa Kudrow's Kevin Bacon number is 3.
Lisa Kudrow appeared in Clockwatchers (1997) with Parker Posey.
Parker Posey appeared in You've Got Mail (1998) with Tom Hanks.
Tom Hanks appeared in Apollo 13 (1995) with Kevin Bacon.
In fact, there are two paths with three edges between Lisa Kudrow and Kevin Bacon. Instead of going through Tom Hanks, the path could have gone through Meg Ryan. That's OK, because we don't need all the paths with fewest edges to Kevin Bacon; we just need one of them.
A piece of advice about processing the lines of the input files. For each line of each file, you will need to find the string to the left of the pipe symbol and the string to the right of the pipe symbol. The String
method indexOf
returns the index of a character within a string. Once you know where the pipe symbol is within a line, you can use the String
method substring
to extract the strings on either side of the pipe symbol. As usual, you can read about these methods in the Java documentation.
From the three input files, you will need to form an undirected graph, which we call the Bacon graph (though there's nothing special about Kevin Bacon's place in the graph). It should be like the one above for the test files.
If you look carefully at the test files, you'll see something interesting. In the figure, the edge between Tom Hanks (ID 1) and Meg Ryan (ID 841) is labeled with the movie Sleepless in Seattle (ID 539). But could also have been labeled it with the movie You've Got Mail (ID 2424). Put another way, we could have had two edges between Tom Hanks and Meg Ryan, since they made both of those rom-coms together. When you form the graph, you will call a method insertEdge
, which throws an IllegalArgumentException
if the graph already contains an edge between the two vertices given in the call. You must catch this exception and ignore it. The extra edge won't be added to the graph, but you don't want the exception to crash your program. So catch it and ignore it.
Don't go defining your own classes for graphs. Use NamedAdjacencyMapGraph
, which you wrote in sa10. (You may used the sample solution to sa10 instead of your code if you prefer.) This class takes generic types V
and E
. V
is the type of the information stored in each vertex; you'll want it to be String
because it's an actor's name. E
is the type of information stored in each edge (in addition to the vertices on which it is incident); you'll want it to be String
again, this time because it's the title of a movie. The constructor for the class takes a single parameter, a boolean, indicating whether the graph should be directed (true
) or undirected (false
).
The main
program given in sa10 for testing your NamedAdjacencyMapGraph
methods gives examples of how to create a graph, add vertices and edges, iterate through incident edges, and catch exceptions. You might want to look at it.
You may find it useful to create maps for mapping IDs to actor names and IDs to movie names. You can also use a map to figure out which actors appeared in each movie, and can use that information to add the appropriate edges to the graph. I suggest that you make, for each movie, some sort of ordered set of actors in the movie. For example, you could make an ArrayList
of actors for each movie. As you process the movie-actors.txt file, add the actor in each line to the movie in that line. When you're done reading the file, you have, for each movie, an ArrayList
of all the actors in that movie. Once you read the entire file, you're ready to add edges to the Bacon graph. For each movie, go through the ArrayList
of actors in that movie. For each actor x in the ArrayList
, go through all the other actors that appear after him or her in the ArrayList
and add an edge between x and the other actor. If a movie has k actors, then you'll add k − 1 edges between the first actor and all those who follow, k − 2 edges between the second actor and all those who follow, and so on, until the next-to-last actor, who will have an edge to only the last actor in the ArrayList
. The total number of edges added will be (k − 1) + (k − 2) + (k − 3) + ⋯ + 2 + 1 + 0, which equals k(k − 1)/2. (You don't have to use an ArrayList
. It's just a suggestion.)
Once the Bacon graph is formed, you'll need to run BFS from a root vertex (the Kevin Bacon vertex). I have covered the concept of BFS in lecture, but it's up to you to implement it. I suggest that you store the BFS tree in a separate graph, and that you make it be a directed graph. Edges go toward the root, so that you can find a path in the BFS tree by following directed edges from a vertex to the root. As with the Bacon graph, you should use a NamedAdjacencyMapGraph
.
You will need a queue for your BFS. Any sort of linked list will do. You can use the SentinelDLL
class that we've seen, or you can use a linked-list class from the Java library.
Do you need to record the distance from the root for each vertex? That is, do you need to record Kevin Bacon numbers? I'll leave that decision up to you. You will need to traverse the path from the vertex for a given actor to the root when producing output, and you can determine the length of the path at that time. Or you can store distances as you construct the BFS tree. You might find it convenient to store distances in a Map
whose keys are either actor names or Vertex
objects.
You don't need to compute the BFS tree upon each actor query, since you're using the same BFS tree each time. What you do need to do anew for each query is construct a path from the given actor to the root (Kevin Bacon). As mentioned before, you can just follow directed edges until you get to the root.
In your directed BFS tree, each vertex has an most one edge leaving it. How do you find that one edge? You can create an iterator for that vertex's outgoing edges (the outgoingEdges
method of AdjacencyMapGraph
returns such an iterator), and then you can call next
once on that iterator. It will return the one edge that leaves the vertex. The opposite
method will tell you the next vertex on the page. You can then fire up another iterator on that next vertex to find its leaving edge, and so on, until you arrive at the root. So you would keep creating iterators and running each one's next
method one time before going on to the next iterator. One way to know when to stop is that the root of the BFS tree has no leaving edges, and so if you call hasNext
after creating an iterator but before calling next
, and if hasNext
returns false
, you're at the root.
You can print out edges as you find them, or you can append them to a list and then print out the edges in the list. It's up to you. (If you make a list, then you can find the length of the path by just asking for the length of the list.)
If there is no path from the actor to the root, your program should say so. And if the actor is not even in the actors.txt file, your program should say that, too. For example:
Enter the name of an actor: Art Clokey
Art Clokey's Kevin Bacon number is infinite.
Enter the name of an actor: Moe Howard
Moe Howard is not in our database.
A couple of extra-credit ideas are implemented in the sample solution.
Why be limited to the Kevin Bacon game? Why not be able to play the Tom Hanks game, for example? The sample solution provides an option where if the first character typed into the console is a question mark, then the rest of the line is taken as the name of an actor who becomes the root. For example:
To quit the program, type return in answer to a question.
To change the root to another actor, start the line with a question mark.
To see actors with a specific number, type the number.
Enter the name of an actor: Parker Posey
Parker Posey's Kevin Bacon number is 2.
Parker Posey appeared in You've Got Mail (1998) with Meg Ryan.
Meg Ryan appeared in In the Cut (2003) with Kevin Bacon.
Enter the name of an actor: ?Tom Hanks
Now playing the Tom Hanks game.
Enter the name of an actor: John Belushi
John Belushi's Tom Hanks number is 3.
John Belushi appeared in Neighbors (1981) with Dan Aykroyd.
Dan Aykroyd appeared in Great Outdoors, The (1988) with John Candy.
John Candy appeared in Splash (1984) with Tom Hanks.
Of course, when you switch to a new root, you do have to compute a whole new BFS tree.
The sample solution also provides an option where, if the first character typed into the console is a digit, then the entire line is converted to an integer, and all actors whose distance from the root is given by that integer are listed. Continuing the previous example:
Enter the name of an actor: 7
Actors with a Tom Hanks number of 7:
Gordon Harker
Randle Ayrton
Jobyna Ralston
Jameson Thomas
Richard Arlen
Clara Bow
Malcolm Keen
Gibb McLaughlin
Anny Ondra
Enter the name of an actor: 11
Actors with a Tom Hanks number of 11:
No actors have a Tom Hanks number of 11.
This option was implemented by making a Map
that maps actor names to their distance from the root, and filling in entries the BFS tree is built.
It's so easy to misspell an actor's name or to just use the wrong form of the name. Perhaps you type in Timothy Allen instead of Tim Allen or Kate Blanchett instead of Cate Blanchett. Devise some sort of scheme for finding "near matches" and reporting near matches to the name entered, so that the user can re-enter the correct form of the name.
Or come up with some interesting statistics. What is the average Kevin Bacon number among actors with finite Kevin Bacon numbers? The median? What pair of actors has the longest path between them? (The length of a longest path is called the diameter of the graph.) Which actor has the shortest longest finite path? (In other words, from which vertex do you minimize the maximum finite path length?)
For the basic assignment, submit everything in Canvas. Do not include net.datastructures; we already have it. You also do not need to include NamedAdjacencyMapGraph.java, unless you have made changes to it (which you will of course clearly indicate, if you have made changes). Submit the basic part of the assignment on Canvas in the Lab Assignment 5 column.
If you do extra credit, submit the files needed to run your extra credit submission; if some files are shared submit them twice. Submit the basic part of the assignment on Canvas in the Lab Assignment 5 column, and extra credit in the Lab Assignment 5 Extra Credit column. We ask for your extra credit to be submitted separately so that if it does not work correctly, we can still grade the basic assignment. Again, do not submit net.datastructures or NamedAdjacencyMapGraph.java (unless you have made changes to it).
Along with your code, include output from test runs. As well as running the program, also print out the graph generated by BFS on the small data set, so that we can see that your BFS it is working correctly.
Your section leader will run your program and will grade this assignment looking for the following things:
Total of 130 points
20 | BFS works correctly |
---|---|
10 | Reads data files |
20 | Creates the graph |
15 | Interactive interface with the user (read names, print results) |
15 | Traces the path from an actor back to Kevin Bacon |
5 | Handles boundary cases correctly. |
10 | Good decomposition into objects and methods. |
---|---|
5 | Proper use of instance and local variables |
2 | Instance variables are private |
3 | Proper use of parameters |
5 | Comments for classes |
---|---|
5 | Comments for methods (purpose, parameters, what is returned) in JavaDoc form. |
3 | Good names for methods, variables, parameters |
1 | Layout (blank lines, indentation, no line wraps, etc.) |
1 | Other unspecified style issues |
4 | Output of BFS graph for small data set. |
---|---|
6 | Demonstrate that your code works, including boundary cases (infinite Bacon number, not found in database, etc.) |