Go Machine Learning Projects
上QQ阅读APP看书,第一时间看更新

Handling bad numbers

Another part of the janitorial work is handling bad numbers. A good example is in the LotFrontage variable. From the data description, we know that this is supposed to be a continuous variable. Therefore, all the numbers should be directly convertible to float64. Looking at the data, however, we see that it's not true—there is data that is NA.

LotFrontage, according to the description, is the linear feet of the street connected to property. NA could mean one of two things:

  • We have no information on whether there is a street connected to the property
  • There is no street connected to the property

In either case, it would be reasonable to replace NA with 0. This is reasonable, because the second lowest value in LotFrontage is 21. There are other ways of imputing the data, of course, and often the imputations will lead to better models. But for now, we'll impute it with 0.

We can also do the same with any other continuous variables in this dataset simply because they make sense when you replace the NA with 0. One tip is to use it in a sentence: this house has an Unknown GarageArea. If that is the case, then what should be the best guess? Well, it'd be helpful to assume that the house has no garage, so it's OK to replace NA with 0.

Note that this may not be the case in other machine learning projects. Remember—human insight may be fallible, but its often the best solution for a lot of irregularities in the data. If you happen to be a realtor, and you have a lot more domain knowledge, you can infuse said domain knowledge into the imputation phase—you can use variables to calculate and estimate other variables for example.

As for the categorical variables, we can for the most part treat NA as the zero value of the variable, so no change there if there is an NA. There is some categorical data for which NA or None wouldn't make sense. This is where the aforementioned clever encoding of category could come in handy. In the cases of these variables, we'll use the most commonly found value as the zero value:

  • MSZoning
  • BsmtFullBath
  • BsmtHalfBath
  • Utilities
  • Functional
  • Electrical
  • KitchenQual
  • SaleType
  • Exterior1st
  • Exterior2nd

Furthermore, there are some variables that are categorical, but the data is numerical. An example found in the dataset is the MSSubclass variable. It's essentially a categorical variable, but its data is numerical. When encoding these kinds of categorical data, it makes sense to have them sorted numerically, such that the 0 value is indeed the lowest value.