Using the Feature Selection node creatively to remove or decapitate perfect predictors>
In this recipe, we will identify perfect or near perfect predictors in order to insure that they do not contaminate our model. Perfect predictors earn their name by being correct 100 percent of the time, usually indicating circular logic and not a prediction of value. It is a common and serious problem.
When this occurs we have accidentally allowed information into the model that could not possibly be known at the time of the prediction. Everyone 30 days late on their mortgage receives a late letter, but receiving a late letter is not a good predictor of lateness because their lateness caused the letter, not the other way around.
The rather colorful term decapitate is borrowed from the data miner Dorian Pyle. It is a reference to the fact that perfect predictors will be found at the top of any list of key drivers ("caput" means head in Latin). Therefore, to decapitate is to remove the variable at the top. Their status at the top of the list will be capitalized upon in this recipe.
The following table shows the three time periods; the past, the present, and the future. It is important to remember that, when we are making predictions, we can use information from the past to predict the present or the future but we cannot use information from the future to predict the future. This seems obvious, but it is common to see analysts use information that was gathered after the date for which predictions are made. As an example, if a company sends out a notice after a customer has churned, you cannot say that the notice is predictive of churning.
Getting ready
We will start with a blank stream, and will be using the cup98lrn reduced vars2.txt
data set.
How to do it...
To identify perfect or near-perfect predictors in order to insure that they do not contaminate our model:
- Build a stream with a Source node, a Type node, and a Table then force instantiation by running the Table node.
- Force TARGET_B to be flag and make it the target.
- Add a Feature Selection Modeling node and run it.
- Edit the resulting generated model and examine the results. In particular, focus on the top of the list.
- Review what you know about the top variables, and check to see if any could be related to the target by definition or could possibly be based on information that actually postdates the information in the target.
- Add a CHAID Modeling node, set it to run in Interactive mode, and run it.
- Examine the first branch, looking for any child node that might be perfectly predicted; that is, look for child nodes whose members are all found in one category.
- Continue steps 6 and 7 for the first several variables.
- Variables that are problematic (steps 5 and/or 7) need to be set to None in the Type node.
How it works...
Which variables need decapitation? The problem is information that, although it was known at the time that you extracted it, was not known at the time of decision. In this case, the time of decision is the decision that the potential donor made to donate or not to donate. Was the amount, Target_D known before the decision was made to donate? Clearly not. No information that dates after the information in the target variable can ever be used in a predictive model.
This recipe is built of the following foundation—variables with this problem will float up to the top of the Feature Selection results.
They may not always be perfect predictors, but perfect predictors always must go. For example, you might find that, if a customer initially rejects or postpones a purchase, there should be a follow up sales call in 90 days. They are recorded as rejected offer in the campaign, and as a result most of them had a follow up call in 90 days after the campaign. Since a couple of the follow up calls might not have happened, it won't be a perfect predictor, but it still must go.
Note that variables such as RFA_2
and RFA_2A
are both very recent information and highly predictive. Are they a problem? You can't be absolutely certain without knowing the data. Here the information recorded in these variables is calculated just prior to the campaign. If the calculation was made just after, they would have to go. The CHAID tree almost certainly would have shown evidence of perfect prediction in this case.
There's more...
Sometimes a model has to have a lot of lead time; predicting today's weather is a different challenge than next year's prediction in the farmer's almanac. When more lead time is desired you could consider dropping all of the _2
series variables. What would the advantage be? What if you were buying advertising space and there was a 45 day delay for the advertisement to appear? If the _2
variables occur between your advertising deadline and your campaign you might have to use information attained in the _3
campaign.