Exploratory data analysis
Exploratory data analysis is part and parcel of any model-building process. Understanding the algorithm at play, too, is important. Given that this chapter revolves around linear regression, it might be worth it to explore the data through the lens of understanding linear regression.
But first, let's look at the data. One of the first things I recommend any budding data scientist keen on machine learning to do is to explore the data, or a subset of it, to get a feel for it. I usually do it in a spreadsheet application such as Excel or Google Sheets. I then try to understand, in human ways, the meaning of the data.
This dataset comes with a description of fields, which I can't enumerate in full here. A snapshot, however, would be illuminating for the rest of the discussion in this chapter:
- SalePrice: The property's sale price in dollars. This is the dependent variable that we're trying to predict.
- MSSubClass: The building class.
- MSZoning: The general zoning classification.
- LotFrontage: The linear feet of the street connected to the property.
- LotArea: The lot size in square feet.
There can be multiple ways of understanding linear regression. However, one of my favorite ways of understanding linear regression directly ties into exploratory data analysis. Specifically, we're interested in looking at linear regression through the lens of the conditional expectation functions (CEFs) of the independent variable.
The conditional expectation function of a variable is simply the expected value of the variable, dependent upon the value of another variable. This seems like a rather dense subject to get through, so I shall offer three different views of the same topic in an attempt to clarify:
- Statistical point of view: The conditional expectation function of a dependent variable given a vector of covariates is simply the expected value of (the average) when is fixed to .
- Programming point of view in pseudo-SQL: select avg(Y) from dataset where X = 'Xi'. When conditioning upon multiple conditions, it's simply this: select avg(Y) from dataset where X1 = 'Xik' and X2 = 'Xjl'.
- Concrete example: What are the expected house prices if one of the independent variables—say, MSZoning—is RL? The expected house price is the population average, which translates to: of all the houses in Boston, what is the average price of house sold whose zoning type is RL?
As it stands, this is a pretty bastardized version of what the CEF is—there are some subtleties involved in the definition of the CEF, but that is not within the scope of this book, so we shall leave that for later. For now, this rough understanding of CEF is enough to get us started with our exploratory data analysis.
The programming point of view in pseudo-SQL is useful because it informs us about what we would need so that we can quickly calculate the aggregate of data. We would need to create indices. Because our dataset is small, we can be relatively blasé about the data structures used to index the data.