Define ψ(x):RD→RM and input data X∈RN×D and Ψ∈RN×M, where Ψ=ψ(X). Then we have the prediction matrix Y^=Ψw. Then y∣x∼N(wTψ(x),σ2).
if N<M, then MLE will not be unique. Flexibility may require a large number M of features which may need to depend on N. That is, we need kernal method to solve this problem (which is more flexible).
Kernal in Ridge Regression
Recall ridge regression problem, we need to minimize E(w)=0.5∥y−Ψw∥2+2λwTw (i.e. ΔE(w)=ΨTΨw−ΨTy+λw=0) and we have solution w=λ1ΨT(y−Ψw)∈RM.
Now let's look kernal ridge regression, def: gram matrix (Kernal Matrix) K=ΨTΨ∈RM×M, where Kij=ψ(x(i))⊤ψ(x(j))=:k(x(i),x(j)) which is is PSD.
And we define a=(y−Ψw)/λ∈RN, then we have w=ΨTa and E(a)=21∥y−ΨΨTa∥2+2λaTΨΨTa=21∥y−Ka∥2+2λaTKa
And reversely, plug in w=ΨTa into a=(K+λIN)−1y, we have w=ΨT(K+λI)−1y.
y^(x)=ψ(x)Tw=ψ(x)TΨTa=k(x)T(K+λI)−1y.
some kernal Functions
- Radial Basis Function depend on the distance from μj (i.e. ψj(x)=h(∥x−μj∥))
- Sigmoidal Basis Function: h is sigmoid.
- Gaussian Basis Function: h is Gaussian.
Gaussian Process
def: Gaussian Process is a probability distribution over functions y^(x) such that for any N≥1 and any set of N points x(1),…,x(N)∈RD, the vector y^(x(1)),…,y^(x(N)) has a multivariate Gaussian distribution.
y∣x∼N(y^(x),σ2) and y^(x)=wTψ(x) and a Gaussian prior w∼N(0,α−1IM), then y^=Ψw∼N(0,α−1ΨΨT).
When y^(x) is a Gaussian process with mean 0, then with covariance function α−1k(x,x′).
Given y∣x∼N(y^(x),σ2),y^(x)=wTΨ(x) and N independent observations, we have y∣y^∼N(y^,σ2IN) and y^∼N(0,α−1ΨΨT). And then we have marginal of y is given by y∼N(0,K+σ2IN) where the corresponding kernal is K+σ2IN=C=c(x(i),x(j))=α−1k(x(i),x(j))+σ2δ(x(i),x(j)).
- If x=x′, then δ(x,x′)=1, or 0 otherwise.
MLE: logp(y∣θ)=−21log∣CN∣−21yTCN−1y−2Nlog2π
Example: regression
Denote yN=(y(1),…,yN) and marginal of yN∼N(0,KN+σ2IN).
Then we can predict new output y(N+1) where yN+1∼N(0,CN+1), CN+1=KN+1+σ2IN+1
Or CN+1=[CNkTkc], where:
- c=α1k(xN+1,xN+1)+σ2
- ki=α1k(x(i),x(N+1))
Therefore, yN+1∣yN∼N(kTCN−1yN,c−kTCN−1k)
Example: classification
Let a GP over a function a(x) then transform it using sigmoid y^(x)=σ(a(x)) and then p(y∣a)=σ(a)y(1−σ(a))1−y.
Then we have aN+1∼N(0,CN+1) and CN+1(x(i),x(j))=α1k(x(i),x(j))+νδij
Since aN is some function which is not directly observed, then we have p(yN+1∣yN)=∫p(yN+1∣aN+1)p(aN+1∣yN)daN+1
But it's intractable so that we need MCMC based method or numericaal integration to approximate this integral.