CS 134 Project: Major League Baseball Performance Prediction

Problem To Solve

In 2009, the average major league baseball player earned a salary just shy of three million dollars [5]. Because of the amount of money involved, team owners would like some assurances about a player's future production, to ensure that they will be getting their money's worth. Several methods have been developed to this end, including Bill James's similarity score system [2], which assigns a default similarity score and adjusts this based on statistical differences and position modifiers, and Nate Silver's PECOTA system[7], which uses nearest neighbor analysis to match a player with others with similar statistics and predict the player's future production based on these comparisons.

We plan to wade into these waters ourselves using machine learning techniques to predict a player's future performance using a method similar to Silver's approach (more on the differences between the two in the next section). We plan on taking a player's past performance, using this data to determine which of the players in our training set are most similar to our test subject, and using the training data to predict the performance of a test subject.

Our Project Plan, and How It Differs From PECOTA

We plan to use a k-nearest neighbor regression algorithm for our prediction program, an algorithm that generally operates by classifying an unknown object based on the most common classification among its k nearest neightbors[3]. In previous work, k-nearest neighbor regressions have been used to predict a wide variety of object attributes, ranging from the basal area distribution of trees[4] to the hand velocity of monkeys (using neural activity)[6].

For this project, we will restrict ourselves to predicting a small subset of batting statistics (i.e., we will ignore pitchers). We will begin with predicting batting average, home runs, and runs batted in, but may expand this list as time permits.

Our overall strategy will be as follows:

Collect season-by-season batting data from our desginated data sources.
Connect player data over individual seasons to see how players performed over time, and store the season records within a database for ease of processing.
Implement a k-nearest neighbor algorithm that takes an group of data points representing a player's year-by-year performance and determines the k players whose data points are closest to our test player data.
For each test player, use our algorithm to find the k players nearest to the test player, and use their data to generate a prediction for the player's performance in the next year. (Players closer to the test data may be weighted more than players who are farther away.)

Our system falls under the following categories:

Non-parametric. Because we will be using a subset of past data to make our performance predictions, we will need to keep our training data around for the testing stage. This also means we will be computing our prediction values on the fly rather than using a regression model made from all of our training data.
Unsupervised learning. The 'correctness' of a training data point will depend on how similar it is to a specific test data point, so we are not able to label the data as correct or incorrect beforehand.

As stated earlier, the PECOTA system also uses a nearest-neightbor algorithm for its prediction scheme[7]. Our project will differ from PECOTA in the following ways:

PECOTA predicts a player's performance by only looking at a three-year window of that player's career[9]. We plan on considering a player's entire career when looking for similar players to use for comparison.
PECOTA generates seven predicted statistic lines, each with its own confidence level[8]. In our program, we will output only one prediction, based on a weighted average of the performance of similar players.

Data Sets To Use

We have several options for collecting data, with different amounts of data and associated costs. The Baseball Guru[10] provides data from 1997 to 2009 free of charge, though we will need to manually link player data across the seasons. In contrast, Baseball-Reference.com[1] provides data stretching back to 1871, but would only provide this data for a fee (and forbids crawling their website for large amounts of data with an automated tool). Between these two data sets (and perhaps others), we should be able to gather enough player information to make useful predictions.

Our plan is to use any data we collect up to the 2008 season as our training data set, and use this for initially tuning the system. We will set aside the 2009 season as our official test set, to ensure that we have a set to measure our system's performance on unknown data.

Project Timeline

Weeks 1-2 (April 14-24): Collect and process data
Weeks 3-4 (April 25-May 8): Implement k-nearest neighbor algorithm, put together milestone report
Weeks 5-6 (May 9-22): Test on training data set, run program against test data set, expand scope of predictions as time permits
Weeks 7-8 (May 23-June 1): Put together final report,tie up any loose ends

References

1. Baseball-Reference.com. Sports Reference LLC, 2000. Web. 12 April 2010. http://www.baseball-reference.com/.
2. James, Bill. The Politics of Glory. New York: Macmillan Publishing Company, 1994. p.88-106. Print.
3. "K-nearest neighbor algorithm." Wikipedia.com . Wikipedia, n.d. Web. 12 April 2010. http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
4. Maltamo, Matti, and Annika Kangas. "Methods based on k-nearest neighbor regression in the prediction of basal area diameter distribution." Canadian Journal of Forest Research 28 (1998): 1107-1115.
5. "Major League Baseball Players Association: Frequently Asked Questions." MLBPlayers.com. MLB Players Association, n.d. Web. 12 April 2010. http://mlbplayers.mlb.com/pa/info/faq.jsp.
6. Navot, Amir, et al. "Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity." Advances in Neural Information Processing Systems. 5-10 Dec. 2005, Vancouver, BC, Canada. Ed. Y. Weiss, B. Schölkopf, and J. Platt. Cambridge, MA: MIT Press, 2006.
7. "PECOTA." Wikipedia.com. Wikipedia, n.d. Web. 12 April 2010. http://en.wikipedia.org/wiki/PECOTA
8. Schwartz, Alan. "KEEPING SCORE; Predicting Futures in Baseball, and the Downside of Damon." New York Times 13 Nov. 2005: n.p. Web. 12 April 2010. http://query.nytimes.com/gst/fullpage.html?res=9C0CEEDA133EF930A25752C1A9639C8B63.
9. Silver, Nate. "Was Gaylord Perry Too Generic to be a Hall of Famer?" BP: Unfiltered. Baseball Prospectus, 8 Jan. 2007. Web. 12 April 2010. http://www.baseballprospectus.com/unfiltered/?p=136. While the initial question pertained to pitchers, Silver declared that "the PECOTAs should generally be thought of as a three year cross-section of a player's career."
10. "The Baseball Guru's Private Data Archive." BaseballGuru.com. The Baseball Guru, n.d. Web. 12 April 2010. http://baseballguru.com/bbdata1.html.

Major League Baseball Performance Prediction

Jason Reeves CS 134 Project

Jason Reeves
CS 134 Project