Machine Learning with Scala Quick Start Guide
上QQ阅读APP看书,第一时间看更新

General machine learning rule of thumb

The general machine learning rule of thumb is that the more data there is, the better the predictive model. However, having more features often creates a mess, to the extent that the performance degrades drastically, especially if the dataset is high-dimensional. The entire learning process requires input datasets that can be split into three types (or are already provided as such):

  • A training set is the knowledge base coming from historical or live data that is used to fit the parameters of the ML algorithm. During the training phase, the ML model utilizes the training set to find optimal weights of the network and reach the objective function by minimizing the training error. Here, the back-prop rule or an optimization algorithm is used to train the model, but all the hyperparameters are needed to be set before the learning process starts.
  • A validation set is a set of examples used to tune the parameters of an ML model. It ensures that the model is trained well and generalizes toward avoiding overfitting. Some ML practitioners refer to it as a development set or dev set as well.
  • A test set is used for evaluating the performance of the trained model on unseen data. This step is also referred to as model inferencing. After assessing the final model on the test set (that is, when we're fully satisfied with the model's performance), we do not have to tune the model any further, but the trained model can be deployed in a production-ready environment.

A common practice is splitting the input data (after necessary pre-processing and feature engineering) into 60% for training, 10% for validation, and 20% for testing, but it really depends on use cases. Sometimes, we also need to perform up-sampling or down-sampling on the data based on the availability and quality of the datasets.

This rule of thumb of learning on different types of training sets can differ across machine learning tasks, as we will cover in the next section. However, before that, let's take a quick look at a few common phenomena in machine learning.