Mastering Machine Learning on AWS
上QQ阅读APP看书,第一时间看更新

Understanding linear regression

Regression algorithms are an important algorithm in a data scientist's toolkit as they can be used for various non-binary prediction tasks. The linear regression algorithm models the relationship between a dependent variable that we are trying to predict with a vector of independent variables. The vector of variables is also called the regressor in the context of regression algorithms. Linear regression assumes that there is a linear relationship between the vector of independent variables and the dependent variable that we are trying to predict. Hence, linear regression models learn the unknown variables and constants of a linear function using the training data so that the linear function best fits the training data. 

Linear regression can be applied in cases where the goal is to predict or forecast the dependent variable based on the regressor variables. We will use an example to explain how linear regression trains based on data.

The following table shows a sample dataset where the goal is to predict the price of a house based on three variables:

In this dataset, the Floor Size, Number of Bedrooms, and Number of Bathrooms variables are assumed as independent in linear regression. Our goal is to predict the House Price value based on the variables. 

Let's simplify this problem. Let's only consider the Floor Size variable to predict the house price. Creating linear regression from only one variable or regressor is referred to as a simple linear regression. If we create a scatterplot of these two values, we can see that there is a relationship between them:

Although there is not an exact linear relationship between the two variables, we can create an approximate line that represents the trend. The aim of the modeling algorithm is to minimize the error in creating this approximate line. 

As we know, a straight line can be represented by the following equation:

                              

Hence, the approximately linear relationship in the preceding diagram can also be represented using the same formula, and the task of the linear regression model is to learn the values of  and . Moreover, since we know that the relationship between the predicted variable and the regressors is not strictly linear, we can add a random error variable to the equation that models the noise in the dataset. The following formula represents how the simple linear regression model is represented:

                     

Now, let's consider a dataset with multiple regressors. Instead of just representing the linear relationship between two variables,  and , we will represent a set of regressors as . We will assume that a linear relationship between the dependent variable  and the regressors  is linear.  Thus, a linear regression model with multiple regressors is represented by the following formula:

       

Linear regression, as we've already discussed, assumes that there is a linear relationship between the regressors and the dependent variable that we are trying to predict. This is an important assumption that may not hold true in all datasets. Hence, for a data scientist, using linear regression may look attractive due to its fast training time. However, if the dataset variables do not have a linear relationship with the dependent variable, it may lead to significant errors. In such cases, data scientists may also try algorithms such as Bernoulli regression, Poisson regression, or multinomial regression to improve prediction precision. We will also discuss logistic regression later in this chapter, which is used when the dependent variable is binary. 

During the training phase, linear regression can use various techniques for parameter estimation to learn the values of , and . We will not go into the details of these techniques in this book. However, we recommend that you try using these parameter estimation techniques in the examples that follow and observe their effect on the training time of the algorithm and the accuracy of prediction. 

To fit a linear model to the data, we first need to be able to determine how well a linear model fits the data. There are various models being developed for parameter estimation in linear regression. Parameter estimation is the process of estimating the values of and . In the following sections, will briefly explain the following estimation techniques.