Problem and Objectives

Few people realize immense potential for profits in sports betting. While sports’ betting is highly volatile and risky, a model with even slightly better performance than random guessing could produce very desirable results. In fact, in order to break even, the model has to correctly predict only 52.63% of the outcomes, which does not seem so difficult. In this project we will address the problem relating to two major types of bets: Against the Spread (ATS) and Over/Under (OU) bets. ATS is a bet where one tries to predict whether the difference between the scores of the two teams will be higher or lower than some number given by the bookmakers. OU is a type of bet where one has to predict whether two teams combined will score higher or lower than some number given by the bookmakers.

Here is an example:

Team 1	Team 2	ATS	OU
Miami	Boston	+7	190

Suppose I bet that Miami will win by more than 7 points and that both teams will score less than 190 points. If the final score is 110-102, I would win the ATS bet, but lose the OU bet.

We aim to develop a predictive machine learning model that can be applied to two problems that correspond to the two types of bets. The model will take input vectors consisting of various features of the teams’ past performance and use regression techniques to produce a target value of: (1) the sum of the scores of the two teams and (2) the difference between the scores of the two teams in the upcoming game.

We will assess the performance of our model by computing its proportion of winning past bets relative to the cutoffs established by the bookmaker. If our model is able to win more than 50% of these bets, we will be able to conclude that its predictions are at least as accurate and precise (if not more so) than the bookmaker’s predictions. Additionally, if our model wins more than 52.63% of the bets, we would realize positive profits. As we assume that the bookmaker himself uses an “intelligent” method to set the boundary values for bets, we can also evaluate the performance of our models based on the accuracy of the raw numerical scores produced (i.e. sum and difference of point values).

Data

We are using two datasets for this project. One is collected from the ESPN database and contains all of the statistics (Shooting Percentage, Rebounds, Points, etc.) from ~2600 individual NBA games. Data from the 2009-2010 season will be used for training (754 samples) while 2010-2011 season data will serve for the testing purposes (811 samples). The second dataset is gathered from Oddsshark betting database, and contains scores of the games as well as ATS & OU odds given by the bookmakers for the individual games. [6] [7]

Methods

Base Classifiers

Pruned Variable Random Tree

VR-Tree algorithm works in a similar way as a regular decision tree except that there is a randomization factor introduced in the process of building a tree. As one of the parameters user must specify the probability with which tree will generate deterministic nodes. Randomly split nodes will be generated with the complement of the probability specified in the parameters. [3] [1]

Pruning is done based on reduced-error algorithm. While building each tree, 20% of the data is left out for the testing purposes. At each iteration the node that produces the biggest reduction of the error in test sample is selected for pruning. Iterations continue until further pruning becomes harmful. [1] [9]

BUTIA (Bottom Up Tree Induction Algorithm)

BUTIA deals with two biggest shortcomings of regular trees: possibility of getting stuck at local optimum and only using one feature during the split selection. Unlike most of the tree induction algorithms, BUTIA starts building the tree from the bottom. First, the algorithm uses k-means algorithm to partition original data set into k clusters. For each of those clusters the pair of the clusters that is closest to each other is found using Support Vector Machines. The hyperplane that produces the maximum distance between those two clusters is used as a rule for splitting and the combination of two smaller clusters produces a new cluster. The process is repeated until only one pair of clusters is left. [4]

Ensemble Methods

Bagging

Bagging (Bootstrap sampling) relies on the fact that combination of many independent base learners will significantly decrease the error. Therefore we want to produce as many independent base learners as possible. Each base learner (VR-Tree in our case) will be generated by sampling the original data set with replacement. With each sampling there will be approximately 36.8% samples that have not been used for the training process. These samples will be used to estimate the error in pruning. N (user specified parameter) such ensembles will be generated and for each prediction each of them will vote. Average of their votes will be used as a final prediction. [2] [3]

Coalescence

Coalescence is another ensemble method that relies heavily on randomly generated trees. N individual VR-Trees will be generated and each of them will have a different probability of producing deterministic nodes. As in the case with bagging, average of all N classifiers votes will be used for a final prediction. [2] [3]

Random Forest

Random Forest is an extension of bagging where the major difference is the incorporation of randomized feature selection. At each step Random Forest algorithm will randomly select some set of features (subset of the original set of features) and produce a traditional deterministic split. N such trees will be produced, and their average vote will be used as a final prediction. [2] [3]

AdaBoost

AdaBoost (adaptive boosting) is a method to generate strong classifiers from a pool of weak classifiers. This is done by iteratively drafting classifiers from the pool that provide new insights into the problem, and then boosting the weights of all selected classifiers appropriately so that the weighted sum of their contributions provides an optimal prediction of the target value. [7] While the weak classifier components of AdaBoost only solve binary classification problems, an extension of AdaBoost called AdaBoost.R solves regression problems by reducing them to infinitely many classification problems. [8]

Timeline

We plan to implement all the algorithms above in Java by the milestone date and to reserve the time after the milestone for revisions and evaluation. Gedas will focus on ensemble methods for regression trees and Jason and Michelle will work on AdaBoost.

Already done: Regression tree to be used as base classifier
1/26: Implement Bagging, Coalescence, Random Forest
2/1: Implement AdaBoost
2/8: Implement BUTIA, explore other extensions and modifications
2/19: Milestone Deadline

References

[1] Breiman, Leo. Classification and Regression Trees. Belmont, CA: Wadsworth International Group, 1984.
[2] Zhou, Zhi-Hua. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Taylor & Francis, 2012.
[3] Liu, Fei Tony, Kai Ming Ting, Yang Yu, and Zhi-Hua Zhou. "Spectrum of Variable-Random Trees." Journal of Artificial Intelligence Research (2008).
[4] Barros, Rodrigo, Ricardo Cerri, Pablo Jaskowiak, and Andre Carvalho. "A Bottom-Up Oblique Decision Tree Induction Algorithm." IEEE (2011).
[5] "Sports Betting Information on Odds Shark." Sports Betting Odds. Web. 21 Jan. 2013.
[6] "ESPN NBA Scoreboard." NBA Basketball Scores. Web. 21 Jan. 2013.
[7] Roja, Raul. “AdaBoost and the Super Bowl of Classifiers: A Tutorial Introduction to Adaptive Boosting.” (2009).
[8] Shrestha, D.L. and Solomatine, D.P. “Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression.” Neural Computation (2006): 18:7, 1678-1710.
[9] "Reduced Error Pruning." Reduced Error Pruning. N.p., n.d. Web. 21 Jan. 2013.