## Intro

There are excellent tutorials on linear regression and a suite of other statistical and machine learning techniques for Python programmers. Here, I build a least squares linear regression classifier "from scratch", that is, without using any sophisticated libraries. This is not the right way to do things. `sklearn` is well-tested, fast, and provides all the tools shown here. Doing the mathematics yourself is a great way to learn, and so this post is the first in a series on various machine learning topics attempting to reveal the simple and accessible the underlying mathematics is.

All the code for this project may be found here.

## Data

We consider the Cleveland Heart Disease data set. It offers a number
of observations from ~300 patients including a target variable, `num`

, which
denotes the incidence of heart disease. In the data set it is a categorical
variable with several values, but we will only attempt to identify the presence
or absence of disease indicated by >50% narrowing of a blood vessel.

The data include several real valued observations as well as a few categorical features. We will get to handling categorical data in a regression task below.

## Preprocessing

The Cleveland data set has only a few missing values. These are identified and removed. For some tasks it may be appropriate to replace missing values with, e.g., the mean.

The next step is normalizing the real or integer valued columns, (i.e., those columns with measurements on a scale where distances are meaningful). I have seen it recommended to scale using the maximum and minimum values, but this ignores the specifics of the distribution. Indeed the particular max/min might be wild outliers. Instead, take the z-score of each column.

Best practice dictates checking these columns for (rough) normality. It is often an assumption in some statistical technique that the data be normal (0 mean, unit variance, check them!) and so you should at least look at a histogram to see whether some transformation is appropriate. I do not do this. It's fine. There's nothing outrageous or obviously lognormal so we proceed.

Onto categorical features. A nice tutorial uses a set of scikit-learn classifiers
on the same data set. The preprocessing step does not have any special handling
of categorical variables to real-valued measurements. For instance,
`cp`

denotes the chest pain type, it takes values 1,2,3,4 where 1,2,3 are
different types of chest pain and 4 is asymptomatic. Perhaps some classifiers
can infer that `cp > 2.5`

is not in itself meaningful: It divides chest pain
into two sets, but there's no a priori reason for its values to be
in that particular order, so partitioning by ordinal comparisons is at best
an artificial limitation. Things are worse for linear models
like this one or logistic regression. Such processes infer coefficients
multiplying the values of various features, and a basic assumption is that the distance
between 1 and 2 is the same as the distance between 3 and 4. With categorical
variables even the ordering is meaningless, let alone "distance".

A standard technique is to replace categorical features' columns with dummy
variables: Binary valued columns taking value 1 if a feature matches a
particular value, and 0 otherwise. pandas does this automatically, with
`pd.get_dummies(df.cp,prefix='cp')`

producing a DataFrame with four 0-1
columns, `'cp_1.0', 'cp_2.0'...`

. We can regress on binary variables as
usual. Ordinal comparisons make sense, the particular feature occurs or not,
and regression coefficients are just weightings on the presence or absence of
each particular case.

Take some care to ensure the resulting data matrix does not include redundant
information. `cp`

took values between 1 and 4, so if for any row
`cp_1.0 == cp_2.0 == cp_3.0 == 0`

, we already know `cp_4.0 == 1`

. We can drop
any one of the dummy columns from the data matrix because we will be adding
a column of 1s to represent the intercept of our regression. A full set of dummy columns
and the intercept column are not linearly independent.

Removing the redundant columns is an important step for guaranteeing a unique solution, that is, our parameters should be roughly consistent or at least comparable across runs. It is especially important when performing OLS because we want our matrix to be full rank (the columns must be linearly independent). If you neglect this step, you're likely to get some bizarrely huge values for some of the regression coefficients.

In this case we dropped columns that are definitionally redundant, but there will be other data where some of the features happen to be redundant for other reasons. Deciding which columns to include is called feature selection and will be the topic of another article.

The final steps in our data preparation are to include an "intercept" column
of 1s, split the data into a matrix of observations `X`

of our features and a
vector of targets `y`

, and then split these into a training set and a test set.
Note the Cleveland data has several values `num`

can take, depending on
whether there is any heart disease and if so, what kind of disease occurs.
Because we
only want to classify based on the presence of disease, we collapse all values
where disease is present.

## Linear regression

Ordinary least squares linear regression fits a linear model to our data so \(X \beta\) is a predictor of \(y\), where \(\beta\) is a vector of weights for our various features.

You're asking yourself, "Why linear regression for a binary classification task? Isn't this what logistic regression is for?" Correct. We're pretending we've never even heard of it, and just want to see how far one can get with basic, naive techniques.

The goal is to minimize some function measuring how badly our predictor performs. In OLS this is the residual sum of squares. The vector of errors is

\(\textbf{e} = y - X\beta\)

and so the sum of squared errors is\(\text{RSS} = (y - X\beta)^{t} (y - X\beta)\)

--taking the dot product of the error vector with itself gives the sum of squares of errors.We wish to find the vector `B`

minimizing this value. Differentiating with
respect to `B`

and setting the derivative to zero:

\(\frac{d\text{RSS}}{d\beta}=-2X^{t} (y - X\beta) = 0\)

\(X^{t} X \beta = X^{t} y\)

\(\beta = (X^{t} X)^{-1} X^{t} y\)

Please confirm this calculation yourself (and check that it's a minimum). Notice there is no real "machine learning" going on. No gradient descent or other process that converges on the best estimate for the coefficients. Instead we just do some algebraic manipulations and invert a matrix (N.B. this is why ensuring a full-rank observation matrix is important).

Since this is a classification task, we will naively decide any time an
element of `X @ Beta`

is larger than 0.5 it is predicting `y == 1`

and
if smaller, predicts `y == 0`

.

What did we do so far? A few algebraic manipulations of the data have produced a regression hyperplane \(X\beta\) that passes through the mean of the data. In particular we have inferred a fit such that the sum of the residuals is zero (and so their mean is zero). These errors are uncorrelated with \(X\) and so we at least have some properties that suggest we have a reasonable model.

If we want to go further, and say that our linear model is not just reasonable, but good, we typically have to make some assumptions about the true data-generating process. If you would like to read more about this check out the Gauss-Markov theorem or Gauss-Markov assumptions.

## How did we do?

Let's see how well our classifier does on the training set and the test set. The best way to do this is with a confusion matrix, but we'll just check accuracy for now. We also present the values of various coefficients and also their z-scores. This sort of table is useful for two reasons. First, it reveals a great advantage of linear models which is their interpretability.

The table reveals the coefficients of each type of observation in the linear model. Care must be taken when interpreting this, but at least such a "white-box" model lets us make sure none of of the coefficients are totally outlandish. We may please ourselves by saying "yes, sex has a positive relationship, men have more heart disease," or "age is inversely related? Ah, it's effect must be captured by some of the other features." Very good.

Second, we produce z-scores for each coefficient, which provide a good indication for their significance. A rule of thumb is that a z-score > 2 in absolute value suggests the feature has a significant effect, if smaller it may be insignificant and our model can be simplified (see [2]). Now understand a different statistical test is appropriate for classifiers. I only show the z-score because it's generally a good test for typical linear regression problems.

In the table below, the only small z scores are among dummy variables. If we wish to check whether dropping an entire set of observations (i.e., a group of related dummy variables) we must use a different measure called the F-statistic, which I will discuss another time.

So how did we do? On the training set (80% of the data), this basic predictor correctly classifies 86% of cases, and on the test set 88%. That's pretty good* considering we aren't even using a "real" classifier, no statistics library, and techniques that would be covered in an undergraduate statistics or econometrics sequence.

We will develop some other machine learning techniques "from scratch" in later articles. I hope this series will be interesting and educational. Our goal is to understand the statistics behind ML and to develop that understanding by breaking things down to their (often) simple foundations.

Training set classification accuracy: 204/237 identified: 86.08% Test set classification accuracy: 53/60 identified: 88.33%Coefficients table:

Beta z_score intercept 0.479247 11.824461 age -0.010282 -4.798849 sex 0.158594 18.356171 trestbps 0.028882 16.561706 chol 0.005048 3.131795 fbs -0.057972 -4.390524 thalach -0.047720 -17.500456 exang 0.149688 14.887216 oldpeak 0.032590 11.798650 ca 0.151245 64.168356 cp_1.0 -0.198938 -6.946534 cp_2.0 -0.166761 -10.124868 cp_3.0 -0.226567 -20.394162 restecg_0.0 -0.055289 -8.227967 restecg_1.0 0.051002 0.523778 slope_1.0 -0.068283 -2.144384 slope_2.0 0.050226 1.896164 thal_3.0 -0.197078 -20.976155 thal_6.0 -0.132475 -4.208395*Looking at some random ML papers from the 90s I see some fancy decision trees celebrated for 70-80% accuracy. I suspect the researchers treat all the features in the same way, without the steps I took to account for categorical features.

## References

Cleveland Heart Disease Data Set. (UCI Machine Learning Repository) V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.Hastie, T., Tibshirani, R., Friedman, J.

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. (2009).