Skip to main content

Kernal Method

Define ψ(x):RDRM\psi(x): \R^D \to \R^M and input data XRN×DX \in \R^{N \times D} and ΨRN×M\Psi \in \R^{N \times M}, where Ψ=ψ(X)\Psi = \psi(X). Then we have the prediction matrix Y^=Ψw\hat{Y} = \Psi w. Then yxN(wTψ(x),σ2)y|x \sim N(w^T\psi(x), \sigma^2).

if N<MN < M, then MLE will not be unique. Flexibility may require a large number MM of features which may need to depend on NN. That is, we need kernal method to solve this problem (which is more flexible).

Kernal in Ridge Regression

Recall ridge regression problem, we need to minimize E(w)=0.5yΨw2+λ2wTwE(w) = 0.5 \|y - \Psi w\|^2 + \frac{\lambda}{2}w^T w (i.e. ΔE(w)=ΨTΨwΨTy+λw=0\Delta E(w) = \Psi^T \Psi w - \Psi^T y + \lambda w = 0) and we have solution w=1λΨT(yΨw)RMw = \frac{1}{\lambda} \Psi^T(y - \Psi w) \in \R^M.

Now let's look kernal ridge regression, def: gram matrix (Kernal Matrix) K=ΨTΨRM×MK = \Psi^T \Psi \in \R^{M \times M}, where Kij=ψ(x(i))ψ(x(j))=:k(x(i),x(j))K_{ij} = \psi(x^{(i)})^{\top} \psi(x^{(j)}) =: k(x^{(i)}, x^{(j)}) which is is PSD.

And we define a=(yΨw)/λRNa=(y-\Psi w)/\lambda \in \R^N, then we have w=ΨTaw = \Psi^T a and E(a)=12yΨΨTa2+λ2aTΨΨTa=12yKa2+λ2aTKaE(a) = \frac{1}{2} \|y - \Psi\Psi^T a\|^2 + \frac{\lambda}{2} a^T \Psi\Psi^T a = \frac{1}{2} \|y - Ka\|^2 + \frac{\lambda}{2} a^T K a

And reversely, plug in w=ΨTaw = \Psi^T a into a=(K+λIN)1ya = (K + \lambda I_N)^{-1} y, we have w=ΨT(K+λI)1yw = \Psi^T (K + \lambda I)^{-1} y.

y^(x)=ψ(x)Tw=ψ(x)TΨTa=k(x)T(K+λI)1y\hat{y} (x) = \psi(x)^{T} w = \psi(x)^T \Psi^T a = k(x)^T (K + \lambda I)^{-1} y.

some kernal Functions

  • Radial Basis Function depend on the distance from μj\mu_j (i.e. ψj(x)=h(xμj)\psi_j(x) = h(\|x - \mu_j\|))
  • Sigmoidal Basis Function: hh is sigmoid.
  • Gaussian Basis Function: hh is Gaussian.

Gaussian Process

def: Gaussian Process is a probability distribution over functions y^(x)\hat{y}(x) such that for any N1N \ge 1 and any set of NN points x(1),,x(N)RDx^{(1)}, \dots, x^{(N)} \in \R^D, the vector y^(x(1)),,y^(x(N))\hat{y}(x^{(1)}), \dots, \hat{y}(x^{(N)}) has a multivariate Gaussian distribution.

yxN(y^(x),σ2)y|x \sim N(\hat{y}(x), \sigma^2) and y^(x)=wTψ(x)\hat{y}(x) = w^T\psi(x) and a Gaussian prior wN(0,α1IM)w \sim N(0, \alpha^{-1}I_M), then y^=ΨwN(0,α1ΨΨT)\hat{y} = \Psi w \sim N(0, \alpha^{-1} \Psi \Psi^T).

When y^(x)\hat{y}(x) is a Gaussian process with mean 00, then with covariance function α1k(x,x)\alpha^{-1} k(x, x').

Given yxN(y^(x),σ2),y^(x)=wTΨ(x)y|x \sim N(\hat{y}(x), \sigma^2), \hat{y}(x) = w^T \Psi(x) and NN independent observations, we have yy^N(y^,σ2IN)y|\hat{y} \sim N(\hat{y}, \sigma^2 I_N) and y^N(0,α1ΨΨT)\hat{y} \sim N(0, \alpha^{-1} \Psi \Psi^T). And then we have marginal of yy is given by yN(0,K+σ2IN)y \sim N(0, K + \sigma^2 I_N) where the corresponding kernal is K+σ2IN=C=c(x(i),x(j))=α1k(x(i),x(j))+σ2δ(x(i),x(j))K + \sigma^2 I_N = C = c(x^{(i)}, x^{(j)}) = \alpha^{-1} k(x^{(i)}, x^{(j)}) + \sigma^2 \delta(x^{(i)}, x^{(j)}).

  • If x=xx = x', then δ(x,x)=1\delta(x, x') = 1, or 00 otherwise.

MLE: logp(yθ)=12logCN12yTCN1yN2log2π\log p(y|\theta) = - \frac{1}{2} \log |C_N| - \frac{1}{2} y^T C_N^{-1} y - \frac{N}{2} \log 2\pi

Example: regression

Denote yN=(y(1),,yN)y_N = (y^{(1)}, \ldots, y^{N}) and marginal of yNN(0,KN+σ2IN)y_N \sim N(0, K_N + \sigma^2I_N).

Then we can predict new output y(N+1)y^{(N + 1)} where yN+1N(0,CN+1)y_{N+1} \sim N(0, C_{N+1}), CN+1=KN+1+σ2IN+1C_{N+1} = K_{N+1} + \sigma^2 I_{N+1}

Or CN+1=[CNkkTc]C_{N+1} = \begin{bmatrix} C_N & k \\ k^T & c \end{bmatrix}, where:

  • c=1αk(xN+1,xN+1)+σ2c = \frac{1}{\alpha}k(x^{N+1}, x^{N+1}) + \sigma^2
  • ki=1αk(x(i),x(N+1))k_i = \frac{1}{\alpha}k(x^{(i)}, x^{(N+1)})

Therefore, yN+1yNN(kTCN1yN,ckTCN1k)y_{N+1} | y_N \sim N(k^T C_N^{-1} y_N, c - k^T C_N^{-1} k)

Example: classification

Let a GP over a function a(x)a(x) then transform it using sigmoid y^(x)=σ(a(x))\hat{y}(x) = \sigma(a(x)) and then p(ya)=σ(a)y(1σ(a))1yp(y|a) = \sigma(a)^{y} (1 - \sigma(a))^{1-y}.

Then we have aN+1N(0,CN+1)a_{N+1} \sim N(0, C_{N+1}) and CN+1(x(i),x(j))=1αk(x(i),x(j))+νδijC_{N+1}(x^{(i)}, x^{(j)}) = \frac{1}{\alpha} k(x^{(i)}, x^{(j)}) + \nu\delta_{ij}

Since aNa_N is some function which is not directly observed, then we have p(yN+1yN)=p(yN+1aN+1)p(aN+1yN)daN+1p(y_{N+1}|y_N) = \int p(y^{N+1}|a_{N+1}) p(a_{N+1}|y_N) da_{N+1}

But it's intractable so that we need MCMC based method or numericaal integration to approximate this integral.