Given data $$(X,Y)$$ where $$X_{ij}$$ represents the $$i$$th observation of the $$j$$th feature we may consider a regression $$g(E(Y)) = \alpha + \sum_{j=1}^{p}f_j(X_{ij})$$

$$g$$ is the link function which controls the relation of the model to the response variable. For example, in generalized linear models (GLMs) where $$f_j(X_{ij}) = \beta_j X_{ij}$$, typical choices of the link function allow standard models to be recovered: $$g(z) = z$$ gives ordinary linear regression, $$g(z) = ln(\frac{z}{1-z})$$, a logit link funcion, gives logistic regression.

Instead of restricting $$f$$ to give linear models, a generalized additive model (GAM) fits some smooth function to each predictor.

We will consider GAMs with the identity link and each $$f$$ a scatterplot smoother.

Note that this model assumes the covariates have additive effects. To see why this might be a problem, consider a two variable case where we predict a patient's blood pressure $$Y$$ from age $$X1$$ and weight $$X2$$.

For two patients aged 40 and 70 we have,

$$E(Y\vert X_1 = 40, X_2) = \alpha + f_1(40) + f_2(X_2)$$

$$E(Y\vert X_1 = 70, X_2) = \alpha + f_1(70) + f_2(X_2)$$

and so blood pressure's dependence on weight varies by a constant for different ages: a hard assumption to swallow. It's easy to imagine weight having a more severe effect on older patients. In any case, these simplifying assumptions are often necessary, but one should be aware that they have been made.

Backfitting algorithm

To fit an additive model we first specify that the smoothers will have mean $$0$$ so $$\alpha$$ is just the mean of the response variable. We pick some functional form for our smoothers. A typical choice is the cubic spline smoother but kernel smoothers or other local regression methods are possible.

The backfitting algorithm is:

1. fix $$\hat{\alpha} = \text{mean}(y)$$ and set each $$f_j = 0$$
2. Until convergence of each $$f$$ \begin{align}\text{for}(j=1\dots p)&:\\ &f_j \leftarrow \text{Smooth}(Y - \hat{\alpha} - \sum_{k \neq j}f_k(X_{k})\\&f_j \leftarrow \text{mean}(f_j)\end{align}

Motivation

The backfitting algorithm may be rigorously justified in several ways. I will not give the details, but perhaps simplest is finding it gives a solution to a penalized least squares minimization problem.

Intuitively, given the high dimensional regression problem $$Y=f(X) + \epsilon$$ the data $$X \in \mathbb{R}^p$$ may be so sparse that it is too difficult to fit $$f: \mathbb{R}^p \rightarrow \mathbb{R}$$.

So it is assumed that $$f$$ is just a sum of one-dimensional functions, $$f(X) = \alpha + \sum_{j=1}^{p}f_{j}(x_j)$$ still a big assumption, but the increased flexibility is clear.

Additive models do not sacrifice too much interpretability for this benefit. In a linear model we understand the effect of each feature $$x_j$$ by its coefficient $$\beta_j$$. In additive models, the partial response function $$f_j$$ plays this role. Visualization must be a table of function plots rather than regression coefficients, but this may even be an advantage. For instance, if the nonlinear capabilities of an additive model offer a better fit, the set of partial response function plots can reveal precisely where nonlinearities occur, without doing any math to decompose the effect into linear and nonlinear components.

The backfitting algorithm then involves fitting the residuals from earlier steps for each of the additive component functions. Some hefty computation must be done, as each step involves applying the scatterplot smoother to every component of $$X$$. Fortunately one-dimesnional smoothing can be applied quickly, and the backfitting algorithm tends to converge in relatively few cycles (usually fewer than 20, from ESL).

References

Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. (2009).
Härdle, W., Hall, P. "On the backfitting algorithm for additive regression models". Statistica Neerlandica. 1993.