上QQ阅读APP看书，第一时间看更新

Supervised learning

Supervised learning is the simplest and most well-known automatic learning task. It is based on a number of pre-defined examples, in which the category to which each of the inputs should belong is already known. Figure 2 shows a typical workflow of supervised learning.

An actor (for example, an ML practitioner, data scientist, data engineer, ML engineer, and so on) performs Extraction Transformation Load (ETL) and the necessary feature engineering (including feature extraction, selection, and so on) to get the appropriate data having features and labels. Then he does the following:

Splits the data into training, development, and test sets
Uses the training set to train an ML model

The validation set is used to validate the training against the overfitting problem and regularization
He then evaluates the model's performance on the test set (that is unseen data)
If the performance is not satisfactory, he can perform additional tuning to get the best model based on hyperparameter optimization
Finally, he deploys the best model in a production-ready environment

Supervised learning in action

In the overall life cycle, there might be many actors involved (for example, a data engineer, data scientist, or ML engineer) to perform each step independently or collaboratively.

The supervised learning context includes classification and regression tasks; classification is used to predict which class a data point is part of (discrete value), while regression is used to predict continuous values. In other words, a classification task is used to predict the label of the class attribute, while a regression task is used to make a numeric prediction of the class attribute.

In the context of supervised learning, unbalanced data refers to classification problems where we have unequal instances for different classes. For example, if we have a classification task for only two classes, balanced data would mean 50% pre-classified examples for each of the classes.

If the input dataset is a little unbalanced (for example, 60% data points for one class and 40% for the other class), the learning process will require for the input dataset to be split randomly into three sets, with 50% for the training set, 20% for the validation set, and the remaining 30% for the testing set.