
Assessing a model
Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as an N-Fold cross-validation scheme, and selecting the appropriate labeled data.
Validation
The purpose of this section is to create a Scala class to be used in future chapters for validating models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.
Key metrics
Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with Black (with respect to White) color in the following diagram. Data scientists use the following terminology:
- True positives (TP): These are observations that are correctly labeled as belonging to the positive class (white dots on a dark background)
- True negatives (TN): These are observations that are correctly labeled as belonging to the negative class (black dots on a light background)
- False positives (FP): These are observations incorrectly labeled as belonging to the positive class (white dots on a dark background)
- False negatives (FN): These are observations incorrectly labeled as belonging to the negative class (black dots on a light background)
Categorization of validation results
This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures:
- Accuracy: Represented as ac, this is the percentage of observations correctly classified.
- Precision: Represented as p, this is the percentage of observations correctly classified as positive in the group that the classifier has declared positive.
- Recall: Represented as r, this is the percentage of observations labeled as positive that are correctly classified.
- F-Measure or F-score F1: This is the score of a test's accuracy that strikes a balance between precision and recall. It is computed as the harmonic mean of the precision and recall with values ranging between 0 (worst score) and 1 (best score).
- G-measure: Represented as G, this is similar to the F-measure but is computed as the geometric mean of precision p and recall r.
Implementation
Let's implement the validation formula using the same trait-based modular design used in creating the preprocessor and classifier modules. The Validation
trait defines the signature for the validation of a classification model: the computation of the F1 statistics and the precision-recall pair:
trait Validation { def f1: Double def precisionRecall: (Double, Double) }
Let's provide a default implementation of the Validation
trait of the F1Validation
class. In the tradition of Scala programming, the class is immutable; it computes the counters for TP
, TN
, FP
, and FN
when the class is instantiated. The class takes two parameters:
- The array of actual versus expected class:
actualExpected
- The target class for true positive observations:
tpClass
class F1Validation(actualExpected: Array[(Int, Int)], tpClass: Int) extends Validation { val counts = actualExpected.foldLeft(new Counter[Label])((cnt, oSeries) => cnt + classify(oSeries._1, oSeries._2)) lazy val accuracy = { val num = counts(TP) + counts(TN) num.toDouble/counts.foldLeft(0)( (s,kv) => s + kv._2) } lazy val precision = counts(TP).toDouble/(counts(TP) + counts(FP)) lazy val recall = counts(TP).toDouble/(counts(TP) + counters(FN)) override def f1: Double = 2.0*precision*recall/(precision + recall) override def
precisionRecall
: (Double, Double) = (precision, recall) def classify(actual: Int, expected: Int): Label = { if(actual == expected) { if(actual == tpClass) TP else TN } else { if (actual == tpClass) FP else FN } } }
The precision
and recall
variables are defined as lazy so they are computed only once, when they are either accessed for the first time or the f1
and precisionRecall
functions are invoked. The class is independent of the selected machine learning algorithm, the training, the labeling process, and the type of observations.
Contrary to Java, which defines an enumerator as a class of types, Scala requires enumerators to be singletons that inherit the functionality of the Enumeration
class:
object Label extends Enumeration
{
type Label = Value
val TP, TN, FP, FN = Value
}
K-fold cross-validation
It is quite common that the labeled dataset used for both training and validation is not large enough. The solution is to break the original labeled dataset into K data groups. The data scientist creates K training-validation datasets by selecting one of the groups as a validation set then combining all other remaining groups into a training set as illustrated in the next diagram. The process is known as the K-fold cross validation [2:7].

The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset.
Bias-variance decomposition
There is an obvious challenge in creating a model that fits both the training set and subsequent observations to be classified during the validation phase.
If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations he or she selected for training are representative to the real world.
The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance.
The mathematical definition for the bias, variance, and mean squared error (MSE) of the distribution are defined by the following formulas:
Note
Variance and bias for a true model, θ:

Mean square error:

Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, most of the machines learning techniques have not been introduced yet. Therefore, the example will emulate a multiple models fEst: Double => Double generated from non-overlapping training sets.
These models are evaluated against a test/validation datasets that are emulated by a model, emul
. The BiasVarianceEmulator
emulator class takes the emulator function and the size of the nValues
validation test as parameters. It merely implements the formula to compute the bias and variance for each of the fEst
models:
class BiasVarianceEmulator[T <% Double](emul: Double => Double, nValues: Int) { def fit(fEst: List[Double => Double]): Option[XYTSeries] = { val rf = Range(0, fEst.size) val meanFEst = Array.tabulate(nValues)( x => rf.foldLeft(0.0)((s, n) => s+fEst(n)(x))/fEst.size) // 1 val r = Range(0, nValues) Some(fEst.map(fe => { r.foldLeft(0.0, 0.0)((s, x) => { val diff = (fe(x) - meanFEst(x))/ fEst.size // 2 (s._1 + diff*diff, s._2 + Math.abs(fe(x)-emul(x)))} ) }).toArray) } }
The fit
method computes the variance and bias for each of the fEst
models generated from training. First, the mean of all the models are computed (line 1
), and then used in the computation of the variance and bias. The method returns a tuple (variance, bias) for each of the fEst
model.
Let's apply the emulator to three nonlinear regression models evaluated against validation data:

The client code for the emulator consists of defining the emul
emulator function, and a list, fEst
, of three models defined as tuples of (function, descriptor)
of type (Double=>Double, String)
. The fit
method is call on the model functions extracted through a map, as shown in the following code:
val emul = (x: Double) => 0.2*x*(1.0 + Math.sin(x*0.05)) val fEst = List[(Double=>Double, String)] ( ((x: Double) => 0.2*x, "y=x/5"), ((x: Double) => 0.0003*x*x + 0.18*x, "y=3e-4.x^2-0.18x"), ((x: Double) =>0.2*x*(1+Math.sin(x*0.05), "y=x(1+sin(x/20))/5")) val emulator = new BiasVarianceEmulator[Double](emul, 200) emulator.fit(fEst.map( _._1)) match { case Some(varBias) => show(varBias) case None => … }
The JFreeChart library is used to display the test dataset and the three model functions.

Fitting models to dataset
The variance-bias trade-off is illustrated in the following scatter chart using the absolute value of the bias:

The more complex the function, the lower the bias is. It is usually, but not always related to, a high variance. The most complex function y=x (1+sin(x/20))/5 has by far the highest variance and the lowest bias. The more complex model matches fairly well with the training dataset. As expected, the mean square error reflects the ability of each of the three models to fit the test data.

Mean square error bar chart
The low bias of the complex model reflects in its ability to predict new observations correctly. Its MSE is therefore low, as expected.
Complex models with low bias and high variance are known as overfitting. Models with high bias and low variance are characterized as underfitting.
Overfitting
The methodology presented in the example can be applied to any classification and regression model. The list of models with low variance includes constant function and models independent of the training set. High degree polynomial, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has a low bias, while linear regression applied to nonlinear data has a higher bias [2:8]
Overfitting affects all aspects of the modeling process negatively, for example:
- It is a sure sign of an overly complex model, which is difficult to debug and consumes computation resources
- It makes the model representing minor fluctuations and noise
- It may discover irrelevant relationships between observed and latent features
- It has poor predictive performance
However, there are well-proven solutions to reduce overfitting [2:9]:
- Increasing the size of the training set whenever possible
- Reducing noise in labeled and input data through filtering
- Decreasing the number of features using techniques such as principal components analysis
- Modeling observable and latent noised using filtering techniques such as Kalman or autoregressive models
- Reducing inductive bias in a training set by applying cross-validation
- Penalizing extreme values for some of the model's features using regularization techniques