The Machine Learning Workshop
上QQ阅读APP看书,第一时间看更新

Supervised and Unsupervised Learning

ML is pided into two main categories: supervised and unsupervised learning.

Supervised Learning

Supervised learning consists of understanding the relationship between a given set of features and a target value, also known as a label or class. For instance, it can be used for modeling the relationship between a person's demographic information and their ability to pay loans, as shown in the following table:

Figure 1.18: The relationship between a person's demographic information and the ability to pay loans

Models trained to foresee these relationships can then be applied to predict labels for new data. As we can see from the preceding example, a bank that builds such a model can then input data from loan applicants to determine if they are likely to pay back the loan.

These models can be further pided into classification and regression tasks, which are explained as follows.

Classification tasks are used to build models out of data with discrete categories as labels; for instance, a classification task can be used to predict whether a person will pay a loan. You can have more than two discrete categories, such as predicting the ranking of a horse in a race, but they must be a finite number.

Most classification tasks output the prediction as the probability of an instance to belong to each output label. The assigned label is the one with the highest probability, as can be seen in the following diagram:

Figure 1.19: An illustration of a classification algorithm

Some of the most common classification algorithms are as follows:

  • Decision trees: This algorithm follows a tree-like architecture that simulates the decision process following a series of decisions, considering one variable at a time.
  • Naïve Bayes classifier: This algorithm relies on a group of probabilistic equations based on Bayes' theorem, which assumes independence among features. It has the ability to consider several attributes.
  • Artificial neural networks (ANNs): These replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture. They pass information to one another until a result is achieved.

Regression tasks, on the other hand, are used for data with continuous quantities as labels; for example, a regression task can be used for predicting house prices. This means that the value is represented by a quantity and not by a set of possible outputs. Output labels can be of integer or float types:

  • The most popular algorithm for regression tasks is linear regression. It consists of only one independent feature (x) whose relationship with its dependent feature (y) is linear. Due to its simplicity, it is often overlooked, even though it performs very well for simple data problems.
  • Other, more complex, regression algorithms include regression trees and support vector regression, as well as ANNs once again.

Unsupervised Learning

Unsupervised learning consists of fitting the model to the data without any relationship with an output label, also known as unlabeled data. This means that algorithms in this category try to understand the data and find patterns in it. For instance, unsupervised learning can be used to understand the profile of people belonging to a neighborhood, as shown in the following diagram:

Figure 1.20: An illustration of how unsupervised algorithms can be used to understand the profiles of people

When applying a predictor to these algorithms, no target label is given as output. The prediction, which is only available for some models, consists of placing the new instance into one of the subgroups of data that have been created. Unsupervised learning is further pided into different tasks, but the most popular one is clustering, which will be discussed next.

Clustering tasks involve creating groups of data (clusters) while complying with the condition that instances from one group differ visibly from the instances within the other groups. The output of any clustering algorithm is a label, which assigns the instance to the cluster of that label:

Figure 1.21: A diagram representing clusters of multiple sizes

The preceding diagram shows a group of clusters, each of a different size, based on the number of instances that belong to each cluster. Considering this, even though clusters do not need to have the same number of instances, it is possible to set the minimum number of instances per cluster to avoid overfitting the data into tiny clusters of very specific data.

Some of the most popular clustering algorithms are as follows:

  • k-means: This focuses on separating the instances into n clusters of equal variance by minimizing the sum of the squared distances between two points.
  • Mean-shift clustering: This creates clusters by using centroids. Each instance becomes a candidate for centroid to be the mean of the points in that cluster.
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN): This determines clusters as areas with a high density of points, separated by areas with low density.