Home | Resume | nik.addleman@gmail.com

Intro

There are excellent tutorials on linear regression and a suite of other statistical and machine learning techniques for Python programmers. Here, I build a least squares linear regression classifier "from scratch", that is, without using any sophisticated libraries. This is not the right way to do things. `sklearn` is well-tested, fast, and provides all the tools shown here. Doing the mathematics yourself is a great way to learn, and so this post is the first in a series on various machine learning topics attempting to reveal the simple and accessible the underlying mathematics is.

All the code for this project may be found here.

Data

We consider the Cleveland Heart Disease data set. It offers a number of observations from ~300 patients including a target variable, num, which denotes the incidence of heart disease. In the data set it is a categorical variable with several values, but we will only attempt to identify the presence or absence of disease indicated by >50% narrowing of a blood vessel.

The data include several real valued observations as well as a few categorical features. We will get to handling categorical data in a regression task below.

Preprocessing

The Cleveland data set has only a few missing values. These are identified and removed. For some tasks it may be appropriate to replace missing values with, e.g., the mean.

The next step is normalizing the real or integer valued columns, (i.e., those columns with measurements on a scale where distances are meaningful). I have seen it recommended to scale using the maximum and minimum values, but this ignores the specifics of the distribution. Indeed the particular max/min might be wild outliers. Instead, take the z-score of each column.

Best practice dictates checking these columns for (rough) normality. It is often an assumption in some statistical technique that the data be normal (0 mean, unit variance, check them!) and so you should at least look at a histogram to see whether some transformation is appropriate. I do not do this. It's fine. There's nothing outrageous or obviously lognormal so we proceed.

Onto categorical features. A nice tutorial uses a set of scikit-learn classifiers on the same data set. The preprocessing step does not have any special handling of categorical variables to real-valued measurements. For instance, cp denotes the chest pain type, it takes values 1,2,3,4 where 1,2,3 are different types of chest pain and 4 is asymptomatic. Perhaps some classifiers can infer that cp > 2.5 is not in itself meaningful: It divides chest pain into two sets, but there's no a priori reason for its values to be in that particular order, so partitioning by ordinal comparisons is at best an artificial limitation. Things are worse for linear models like this one or logistic regression. Such processes infer coefficients multiplying the values of various features, and a basic assumption is that the distance between 1 and 2 is the same as the distance between 3 and 4. With categorical variables even the ordering is meaningless, let alone "distance".

A standard technique is to replace categorical features' columns with dummy variables: Binary valued columns taking value 1 if a feature matches a particular value, and 0 otherwise. pandas does this automatically, with pd.get_dummies(df.cp,prefix='cp') producing a DataFrame with four 0-1 columns, 'cp_1.0', 'cp_2.0'.... We can regress on binary variables as usual. Ordinal comparisons make sense, the particular feature occurs or not, and regression coefficients are just weightings on the presence or absence of each particular case.

Take some care to ensure the resulting data matrix does not include redundant information. cp took values between 1 and 4, so if for any row cp_1.0 == cp_2.0 == cp_3.0 == 0, we already know cp_4.0 == 1. We can drop any one of the dummy columns from the data matrix because we will be adding a column of 1s to represent the intercept of our regression. A full set of dummy columns and the intercept column are not linearly independent.

Removing the redundant columns is an important step for guaranteeing a unique solution, that is, our parameters should be roughly consistent or at least comparable across runs. It is especially important when performing OLS because we want our matrix to be full rank (the columns must be linearly independent). If you neglect this step, you're likely to get some bizarrely huge values for some of the regression coefficients.

In this case we dropped columns that are definitionally redundant, but there will be other data where some of the features happen to be redundant for other reasons. Deciding which columns to include is called feature selection and will be the topic of another article.

The final steps in our data preparation are to include an "intercept" column of 1s, split the data into a matrix of observations X of our features and a vector of targets y, and then split these into a training set and a test set. Note the Cleveland data has several values num can take, depending on whether there is any heart disease and if so, what kind of disease occurs. Because we only want to classify based on the presence of disease, we collapse all values where disease is present.

Linear regression

Ordinary least squares linear regression fits a linear model to our data so \(X \beta\) is a predictor of \(y\), where \(\beta\) is a vector of weights for our various features.

You're asking yourself, "Why linear regression for a binary classification task? Isn't this what logistic regression is for?" Correct. We're pretending we've never even heard of it, and just want to see how far one can get with basic, naive techniques.

The goal is to minimize some function measuring how badly our predictor performs. In OLS this is the residual sum of squares. The vector of errors is

\(\textbf{e} = y - X\beta\)

and so the sum of squared errors is

\(\text{RSS} = (y - X\beta)^{t} (y - X\beta)\)

--taking the dot product of the error vector with itself gives the sum of squares of errors.

We wish to find the vector B minimizing this value. Differentiating with respect to B and setting the derivative to zero:

\(\frac{d\text{RSS}}{d\beta}=-2X^{t} (y - X\beta) = 0\)
\(X^{t} X \beta = X^{t} y\)
\(\beta = (X^{t} X)^{-1} X^{t} y\)

Please confirm this calculation yourself (and check that it's a minimum). Notice there is no real "machine learning" going on. No gradient descent or other process that converges on the best estimate for the coefficients. Instead we just do some algebraic manipulations and invert a matrix (N.B. this is why ensuring a full-rank observation matrix is important).

Since this is a classification task, we will naively decide any time an element of X @ Beta is larger than 0.5 it is predicting y == 1 and if smaller, predicts y == 0.

What did we do so far? A few algebraic manipulations of the data have produced a regression hyperplane \(X\beta\) that passes through the mean of the data. In particular we have inferred a fit such that the sum of the residuals is zero (and so their mean is zero). These errors are uncorrelated with \(X\) and so we at least have some properties that suggest we have a reasonable model.

If we want to go further, and say that our linear model is not just reasonable, but good, we typically have to make some assumptions about the true data-generating process. If you would like to read more about this check out the Gauss-Markov theorem or Gauss-Markov assumptions.

How did we do?

Let's see how well our classifier does on the training set and the test set. The best way to do this is with a confusion matrix, but we'll just check accuracy for now. We also present the values of various coefficients and also their z-scores. This sort of table is useful for two reasons. First, it reveals a great advantage of linear models which is their interpretability.

The table reveals the coefficients of each type of observation in the linear model. Care must be taken when interpreting this, but at least such a "white-box" model lets us make sure none of of the coefficients are totally outlandish. We may please ourselves by saying "yes, sex has a positive relationship, men have more heart disease," or "age is inversely related? Ah, it's effect must be captured by some of the other features." Very good.

Second, we produce z-scores for each coefficient, which provide a good indication for their significance. A rule of thumb is that a z-score > 2 in absolute value suggests the feature has a significant effect, if smaller it may be insignificant and our model can be simplified (see [2]). Now understand a different statistical test is appropriate for classifiers. I only show the z-score because it's generally a good test for typical linear regression problems.

In the table below, the only small z scores are among dummy variables. If we wish to check whether dropping an entire set of observations (i.e., a group of related dummy variables) we must use a different measure called the F-statistic, which I will discuss another time.

So how did we do? On the training set (80% of the data), this basic predictor correctly classifies 86% of cases, and on the test set 88%. That's pretty good* considering we aren't even using a "real" classifier, no statistics library, and techniques that would be covered in an undergraduate statistics or econometrics sequence.

We will develop some other machine learning techniques "from scratch" in later articles. I hope this series will be interesting and educational. Our goal is to understand the statistics behind ML and to develop that understanding by breaking things down to their (often) simple foundations.

Training set classification accuracy:
204/237 identified: 86.08%
Test set classification accuracy:
53/60 identified: 88.33%
Coefficients table:
                 Beta    z_score
intercept    0.479247  11.824461
age         -0.010282  -4.798849
sex          0.158594  18.356171
trestbps     0.028882  16.561706
chol         0.005048   3.131795
fbs         -0.057972  -4.390524
thalach     -0.047720 -17.500456
exang        0.149688  14.887216
oldpeak      0.032590  11.798650
ca           0.151245  64.168356
cp_1.0      -0.198938  -6.946534
cp_2.0      -0.166761 -10.124868
cp_3.0      -0.226567 -20.394162
restecg_0.0 -0.055289  -8.227967
restecg_1.0  0.051002   0.523778
slope_1.0   -0.068283  -2.144384
slope_2.0    0.050226   1.896164
thal_3.0    -0.197078 -20.976155
thal_6.0    -0.132475  -4.208395
*Looking at some random ML papers from the 90s I see some fancy decision trees celebrated for 70-80% accuracy. I suspect the researchers treat all the features in the same way, without the steps I took to account for categorical features.

References

Cleveland Heart Disease Data Set. (UCI Machine Learning Repository) V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. (2009).