Extracting new features
We will now extract some features from this dataset by combining and comparing the existing data. First, we need to specify our class value, which will give our classification algorithm something to compare against to see if its prediction is correct or not. This could be encoded in a number of ways; however, for this application, we will specify our class as 1 if the home team wins and 0 if the visitor team wins. In basketball, the team with the most points wins. So, while the data set doesn't specify who wins directly, we can easily compute it.
We can specify the data set by the following:
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]
We then copy those values into a NumPy array to use later for our scikit-learn classifiers. There is not currently a clean integration between pandas and scikit-learn, but they work nicely together through the use of NumPy arrays. While we will use pandas to extract features, we will need to extract the values to use them with scikit-learn:
y_true = dataset["HomeWin"].values
The preceding array now holds our class values in a format that scikit-learn can read.
By the way, the better baseline figure for sports prediction is to predict the home team in every game. Home teams are shown to have an advantage in nearly all sports across the world. How big is this advantage? Let's have a look:
dataset["HomeWin"].mean()
The resulting value, around 0.59, indicates that the home team wins 59 percent of games on average. This is higher than 50 percent from random chance and is a simple rule that applies to most sports.
We can also start creating some features to use in our data mining for the input values (the X array). While sometimes we can just throw the raw data into our classifier, we often need to derive continuous numerical or categorical features from our data.
For our current dataset, we can't really use the features already present (in their current form) to do a prediction. We wouldn't know the scores of a game before we would need to predict the outcome of the game, so we can not use them as features. While this might sound obvious, it can be easy to miss.
The first two features we want to create to help us predict which team will win are whether either of those two teams won their previous game. This would roughly approximate which team is currently playing well.
We will compute this feature by iterating through the rows in order and recording which team won. When we get to a new row, we look up whether the team won the last time we saw them.
We first create a (default) dictionary to store the team's last result:
from collections import defaultdict
won_last = defaultdict(int)
We then create a new feature on our dataset to store the results of our new features:
dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0
The key of this dictionary will be the team and the value will be whether they won their previous game. We can then iterate over all the rows and update the current row with the team's last result:
for index, row in dataset.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
dataset.set_value(index, "HomeLastWin", won_last[home_team])
dataset.set_value(index, "VisitorLastWin", won_last[visitor_team])
won_last[home_team] = int(row["HomeWin"])
won_last[visitor_team] = 1 - int(row["HomeWin"])
Note that the preceding code relies on our dataset being in chronological order. Our dataset is in order; however, if you are using a dataset that is not in order, you will need to replace dataset.iterrows() with dataset.sort("Date").iterrows().
Those last two lines in the loop update our dictionary with either a 1 or a 0, depending on which team won the current game. This information is used for the next game each team plays.
After the preceding code runs, we will have two new features: HomeLastWin and VisitorLastWin. Have a look at the dataset using dataset.head(6) to see an example of a home team and a visitor team that won their recent game. Have a look at other parts of the dataset using the panda's indexer:
dataset.ix[1000:1005]
Currently, this gives a false value to all teams (including the previous year's champion!) when they are first seen. We could improve this feature using the previous year's data, but we will not do that in this chapter.