Advanced Machine Learning with R
上QQ阅读APP看书,第一时间看更新

Reviewing model assumptions

A linear regression model is only as good as the validity of its assumptions, which can be summarized as follows:

  • Linearity: This is a linear relationship between the predictor and the response variables. If this relationship is not explicitly present, transformations (log, polynomial, exponent, and so on) of X or Y may solve the problem.
  • Non-correlation of errors: This is a common problem in time series and panel data where ; if the errors are correlated, you run the risk of creating a poorly specified model.
  • Homoscedasticity: This refers to normally distributed and constant variance of errors, which means that the variance of errors is constant across different input values. Violations of this assumption don't create biased coefficient estimates, but because of improper standard errors for the coefficients can lead to statistical tests for significance that can be either too high or too low, leading to wrong conclusions. This violation is also called heteroscedasticity.
  • No collinearity: No linear relationship should exist between two predictor variables, which is to say that there should be no correlation between the features. This issue can lead to incorrect statistical tests for the coefficients.
  • Presence of outliers: Outliers can severely skew the estimation, and they must be examined and handled via removal or transformation while fitting a model using linear regression; as we saw in the Anscombe example, outliers can lead to a biased estimate.

A simple way to initially check the assumptions is by producing plots. The plot() function, when combined with a linear model fit, will automatically generate four plots, allowing you to examine the assumptions. R produces the plots one at a time, and you advance through them by hitting the Enter key. It's best to explore all four simultaneously, and we can do so in the following manner:

> par(mfrow = c(2,2))

> plot(yield_fit)

The output of the preceding code is as follows:

The two plots on the left allow us to examine the homoscedasticity of errors and non-linearity. What we're looking for is some pattern or, more importantly, that no pattern exists. Given the sample size of only 17 observations, nothing visible exists. Common heteroscedastic errors will appear to be u-shaped, inverted u-shaped, or clustered close together on the left of the plot. They'll become wider as the fitted values increase (a funnel shape). It's safe to conclude that no violation of homoscedasticity is apparent in our model.

The Normal Q-Q plot in the upper-right corner helps us to determine whether the residuals are normally distributed. The Quantile-Quantile (Q-Q) represents the quantile values of one variable plotted against the quantile values of another. It appears that the outliers (observations 7, 9, and 10) may be causing a violation of the assumption. The Residuals vs Leverage plot can tell us what observations, if any, are unduly influencing the model; in other words, if there are any outliers we should be concerned about. The statistic is Cook's distance or Cook's D, and it's generally accepted that a value greater than one should be worthy of further inspection.

What exactly is further inspection? This is where art meets science. The easy way out would be to delete the observation, in this case number 9, and redo the model. However, a better option may be to transform the predictor and/or the response variables. If we just delete observation 9, then maybe observations 10 and 13 would fall outside the band for greater than one. In this simple example, I believe that this is where domain expertise can be critical. More times than I can count, I've found that exploring and understanding outliers can yield valuable insights. When we first examined the previous scatter plot, I pointed out the potential outliers and these happen to be observations number 9 and number 13. It seems important to discuss with the appropriate subject matter experts to understand why this is the case. Is it a measurement error? Is there a logical explanation for these observations? I certainly don't know, but this is an opportunity to increase the value that you bring to an organization.

Let's leave this simple case behind us and move on to a supervised learning case involving multivariate linear regression.