Machine Learning Algorithms（Second Edition）

上QQ阅读APP看书，第一时间看更新

Maximum likelihood learning

We have defined likelihood as a filtering term in the Bayes formula. In general, it has the form of the following:

Here, the first term expresses the actual likelihood of a hypothesis, given a dataset X. As you can imagine, in this formula, there are no more Apriori probabilities, so, maximizing it doesn't imply accepting a theoretical preferential hypothesis, nor considering unlikely ones. A very common approach, known as Expectation Maximization (EM), which is used in many algorithms (we're going to see an example in logistic regression), is split into two main parts:

Determining a log-likelihood expression based on model parameters (they will be optimized accordingly). This is normally a proxy that depends on a set of parameters computed at time t θ_t.
Maximizing it until the residual error is small enough. This operation is performed with respect to θ_t. Hence, at the next iteration, the proxy is computed with the new set of parameters.

This is a special case of a famous general algorithm of the same name (EM). A complete explanation is complex and it's beyond the scope of this book (it can be found in Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing, 2018), however, it's easy to catch the main concepts.

A log-likelihood (normally called L) is a useful trick that can simplify gradient calculations. A generic likelihood expression is as follows:

As all parameters are summarized inside h_i, the gradient is a complex expression that isn't very manageable. However, our goal is maximizing the likelihood, but it's easier to minimize its reciprocal:

This can be turned into a very simple expression by applying a natural logarithm (which is monotonic):

The last term is a summation that can be easily derived and used in most of the optimization algorithms. Contrary to direct algorithms, EM works in an iterative fashion, alternating a step where the likelihood is computed using the current parameter estimation and another one where these parameters are chosen so as to maximize the expected log-likelihood. At the end of this process, we can find a set of parameters that provide the maximum likelihood (ML) without any strong statement about prior distributions. This approach can seem very technical, but its logic is really simple and intuitive. To understand how it works, I propose a simple exercise, which is part of the Gaussian mixture technique which is also discussed in Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing, 2018.

Let's consider 100 points drawn from a Gaussian distribution with zero mean and a standard deviation equal to 2.0 (the Gaussian noise is made up of independent samples):

import numpy as np

np.random.seed(1000)

nb_samples = 100
X_data = np.random.normal(loc=0.0, scale=np.sqrt(2.0), size=nb_samples)

The plot is shown in the following graph:

Points sampled from a Gaussian distribution

In this case, there's no need for a deep exploration (we know how they are generated), however, after restricting the hypothesis space to the Gaussian family (the most suitable considering only the graph), we'd like to find the best value for mean and variance. First of all, we need to compute the log-likelihood (which is rather simple thanks to the exponential function):

A simple Python implementation is provided next (for ease of use, there's only a single array which contains both mean (0) and variance (1)):

def negative_log_likelihood(v):
  l = 0.0
  f1 = 1.0 / np.sqrt(2.0 * np.pi * v[1]) 
  f2 = 2.0 * v[1]
 
  for x in X_data:
    l += np.log(f1 * np.exp(-np.square(x - v[0]) / f2))
 
 return -l

Then, we need to find its minimum (in terms of mean and variance) with any of the available methods (gradient descent or another numerical optimization algorithm). For example, using the scipy minimization function, we can easily get the following:

from scipy.optimize import minimize

minimize(fun=negative_log_likelihood, x0=np.array([0.0, 1.0]))

 fun: 181.6495875832933
 hess_inv: array([[ 0.01959881, -0.00035322],
       [-0.00035322, 0.09799955]])
      jac: array([ 1.90734863e-06, 0.00000000e+00])
  message: 'Optimization terminated successfully.'
     nfev: 40
      nit: 8
     njev: 10
   status: 0
  success: True
        x: array([ 0.08052498, 2.21469485])

A graph of the negative_log_likelihood function is plotted next. The global minimum of this function corresponds to an optimal likelihood given a certain distribution. It doesn't mean that the problem has been completely solved because the first step of this algorithm is determining an expectation, which must always be realistic. The likelihood function, however, is quite sensitive to wrong distributions because it can easily get close to zero when the probabilities are low. For this reason, ML learning is often preferable to MAP learning, which needs Apriori distributions, and can fail when they are not selected in the most appropriate way:

Trajectories from two different starting points and the MLE target

This approach has been applied to a specific distribution family (which is indeed very easy to manage), but it also works perfectly when the model is more complex. Of course, it's always necessary to have an initial awareness about how the likelihood should be determined, because more than one feasible family can generate the same dataset. In all of these cases, Occam's razor is the best way to proceed: the simplest hypothesis should be considered first. If it doesn't fit, an extra level of complexity can be added to our model. As we'll see, in many situations, the easiest solution is the winning one, and increasing the number of parameters or using a more detailed model can only add noise and a higher possibility of overfitting.

SciPy ( https://www.scipy.org) is a set of high-end scientific and data-oriented libraries that are available for Python. It includes NumPy, Pandas, and many other useful frameworks. If you want to read more about Python scientific computing, refer to Learning SciPy for Numerical and Scientific Computing Second Edition, Rojas S. J. R. G., Christensen E. A., Blanco-Silva F. J., Packt Publishing, 2017.