Skip to main content

Moving beyond Linearity

We always make linear assumption to assume a model which make our life easizer. However, linear assumption is not always a good approximation, and sometimes even a poor one. That is, we extend the linear model by feature.

For univariate feature (p = 1), we have:

  • Polynomial regression
  • Step function
  • Regression spline

For Multivariate feature (p > 1), we have:

  • Local regression
  • Generalized additive model

Polynomial Regression

We have yi=β0+β1xi+β2xi2++βkxik+ϵiy_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_k x_i^k + \epsilon_i where ϵi\epsilon_i is the error term as Polynomial Regression.

  • We can use OLS approach to fit such model.
  • Coefficients in polynomial regresson are not interpretable, instead, we will try to interpret the function f^(x0)=β^0+β^1x0+β^2x02++β^kx0k\hat f(x_0) = \hat \beta_0 + \hat \beta_1 x_0 + \hat \beta_2 x_0^2 + \cdots + \hat \beta_k x_0^k.
  • In this class, we will use k4k \le 4 for practice and can be choose via cross-validation
  • We can also use polynomial regression to do classification, i.e. the logit or the log-odds of logistic regression approach can be write as logp1(x)1p1(x)=β0+β1x1+β2x12++βkx1k\log \frac{p_1(x)}{1-p_1(x)} = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \cdots + \beta_k x_1^k where p1(x)=P(Yi=1Xi=xi)p_1(x)= P(Y_i = 1| X_i= x_i).

Step Function

Since the overall structure (or shape) of the polynomial regression is somehow fixed by non-linearity of feature X (i.e. the degree of X), it very bad for extrapolation or we say have larger test error i.e. for data far away from the training data. To solve this problem, we can use Step Function to divide the range of X into several bins.

We can cut up X into many points, let says we wanna cut KK, then we call the points c1,,cKc_1, \ldots, c_K. Then we can have several dummy variables that sums up to 1 for each observations, denote it as C0,,CKC_0, \ldots, C_K where:

  • C0(X)=1{Xc1}C_0(X) = 1\{X \le c_1\}
  • Ck(X)=1{ck1X<eck}C_k(X) = 1\{c_{k-1} \le X <e c_k\}
  • CK(X)=1{XcK}C_K(X) = 1\{X \ge c_K\}

Then we can have the model as yi=β0+β1C1(xi)+β2C2(xi)++βKCK(xi)+ϵiy_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \cdots + \beta_K C_K(x_i) + \epsilon_i where ϵi\epsilon_i is the error term as Step Function model.

  • Note we don’t need C0(xi)C_0(x_i) in the model when we also have the intercept term β0\beta_0.
  • Again, we can also fit this model using OLS approach.
  • We know can interpret the coefficients, where βj\beta_j represents the average change in the respoinse YY for cjX<cj+1c_j \le X < c_{j+1} relative to c0X<c1c_0 \le X < c_1.

It's seems step function well solve the problem of polynomial regression, but it is based on we have a good choice of cut points c1,,cKc_1, \ldots, c_K, and we may not have a good choice of cut points and then the step function will not show the trend.

Regression Spline

We can use similar idea of step function to solve this question where we use those cut points to make different polynomial in different region. Here we also call those cut point as knots. For knot, we can typically place at certain quantiles of the data or on the range of X with equal space. The more knots willlead to more flexible piecewise polynomial. To have a better performance and interpretation, we better add constraints to polynomials at the knots for:

  • Continuity: equal function values
  • Smoothness: equal first and second derivatives
  • higher order derivatives

If we constrained those piecewise polynomial, then we call them Splines. A degree-d spline contains piecewise degree-d polynomials with continuity in derivatives up to degree (d - 1) at each knot.

We have a basic representation for Regression Spline model: yi=β0+β1b1(xi)+β2b2(xi)++ϵiy_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \epsilon_i where ϵi\epsilon_i is the error term, and bj(xi)b_j(x_i) is the jjth basis function. We can find that polynomial regressions and step functions are special cases of regression splines.

Some common splines:

Linear spline: piecewise linear function continuous at each knot. Define knots ξ1<ξ2<<ξK\xi_1 < \xi_2 < \cdots < \xi_K, and we have function with those knots as: yi=β0+β1xi+β2(xiξ1)++β3(xiξ2)+++βK+1(xiξK)++ϵiy_i = \beta_0 + \beta_1 x_i + \beta_2 (x_i - \xi_1)_+ + \beta_3 (x_i - \xi_2)_+ + \cdots + \beta_{K+1} (x_i - \xi_{K})_+ + \epsilon_i where ϵi\epsilon_i is the error term.

  • here bj+1(xi)=(xiξj)+={xiξjxi>ξj0xiξjb_{j+1}(x_i) = (x_i - \xi_j)_+ = \begin{cases} x_i - \xi_j & x_i > \xi_j \\ 0 & x_i \le \xi_j \end{cases}
  • here β1b1(xi)=β1xi\beta_1b_1(x_i) = \beta_1 x_i where β1\beta_1 shows the averaged increase of Y associated with one unit of XX for x<ξ1x < \xi_1.

Cubic spline: piecewise cubic function continuous derivatives up to order 2 at each knot. Define knots ξ1<ξ2<<ξK\xi_1 < \xi_2 < \cdots < \xi_K, and we have function with those knots as: yi=β0+β1xi+β2xi2+β3xi3+β4(xiξ1)3++βK+3(xiξK)3+ϵiy_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3x_i^3 + \beta_4 (x_i - \xi_1)^3 + \cdots + \beta_{K+3} (x_i - \xi_{K})^3 + \epsilon_i where ϵi\epsilon_i is the error term.

Natural Spline: a regression spline with additional boundary constraints where function required to be linear at the boundary.

Local Regression

Local regression predicts at a target point x0x_0 using only the nearby training observations.

We design the algorithm as:

  1. Gather the fraction s=k/ns = k/n of the training observations that are closest to x0x_0.
  2. Assign a weight function Ki0=K(xi,x0)K_{i0} = K(x_i, x_0) to each point in this neighborhood so that the point furthest from x0x_0 has weight zero, and the closest has the highest weight. All but these kk nearest neighbors get weight zero.
  3. Fit a weighted least squares regression of the yiy_i on the xix_i using the aforementioned weights, by finding β^0\hat\beta_0 and β^1\hat\beta_1 that minimize i=1nKi0(yiβ^0β^1xi)2\sum_{i=1}^n K_{i0} (y_i - \hat\beta_0 - \hat\beta_1 x_i)^2.
  4. The fitted value at x0x_0 is f^(x0)=β^0+β^1x0\hat f(x_0) = \hat\beta_0 + \hat\beta_1 x_0.

Some notices for local regression algorithm:

  • the bandwidth ss is a tuning parameter that controls the number of neighbors used in the local regression, and we can use cross-validation to choose the best bandwidth.
  • the weight of each point in the neighborhood needs to be specified.
  • KNN (k-nearest neighbors) is one of the most common local regression approach. It corresponds to a weight function Ki0=1/kK_{i0} = 1/k and β1=0\beta_1 = 0.

KNN

Algorithm for KNN designed as:

  1. Find kk training observations (x(1),y(1)),,(x(k),y(k))(x_{(1)}, y_{(1)}), \ldots, (x_{(k)}, y_{(k)}) that are closest to test instance x0x_0.
  2. Classification output is majority class argmaxyYi=1kI(y(i)=y)\arg\max_{y\in Y} \sum_{i=1}^k I(y_{(i)} = y)

We use Voronoi diagram to visualize the behavior in the classification, and the boundary between regions of the feature space assigned to different categories is our decision boundary.

KNN is sensitive to noise and mislabeled data. It is also computationally expensive to compute the distance between each test instance and each training instance. It works not well in higher dimension where points lead to approximately the same distance. That is, neighborhood structure depends on the intrinsic dimensionality of the data, and low intrinsic dimension is the saving grace.

We also have tradeoffs in choosing kk:

  • small kk leads well capturing fine-grained patterns, more flexible decision boundary but high variance and may sensitive to random idiosyncrasies in the training data.
  • large kk leads to less flexible decision boundary but low variance and may fail to capture important regularities.
  • balancing kk is the optimal choice of kk depends on number of data points nn with nice theoretical properties if k    k/n0k\to \infty \implies k/n \to 0. And such kk can be choose from [1,n][1, \sqrt{n}] via cross-validation.

Generalized Additive Models (GAMs)

GAMs provide a general framework for extending a standard linear model by allowing non-linear function of each of the variables while maintaininng additivity, yi=β0+f1(xi)+f2(xi)++fp(xi)+ϵiy_i = \beta_0 + f_1(x_i) + f_2(x_i) + \cdots + f_p(x_i) + \epsilon_i where ϵi\epsilon_i is the error term.

  • Each fkf_k can be linear, polynomials, step function, splines, and local regression.
  • Can be used for both regression and classification problems.
    • i.e. logistic regression: logit P(Yi=1Xi=xi)=β0+f1(xi)+f2(xi)++fp(xi)P(Y_i = 1 | X_i = x_i) = \beta_0 + f_1(x_i) + f_2(x_i) + \cdots + f_p(x_i)
  • Allow non-linear fjf_j to each XjX_j so that we can capture non-linear relationship between XjX_j and YY.
  • Due to additivity, we can interpret the effect of each XjX_j on YY by fixing all other XjX_j.
  • But failed to capture interaction between XjX_j and XkX_k.