Skip to main content

Bayesian Linear Regression

BLR is used when Gaussian discriminative model p(yX)p(y|X) used for regression with a Bayesian analysis for the weights.

Recall if a random variable xN(μ,Σ)x \sim \mathcal{N}(\mu, \Sigma), then the log probability density function is given by logp(x)=12(xμ)TΣ1(xμ)+const=12xTΣ1x+xTΣ1μ+const\log p(x) = -\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)+\text{const} = -\frac{1}{2} x^T \Sigma^{-1} x + x^T \Sigma^{-1} \mu + \text{const}. That is, if we know some random variable ww follow a Gaussian distribution with logp(w)=12wTAw+wTb+const\log p(w) = -\frac{1}{2} w^T A w + w^T b + \text{const}, then wN(A1b,A1)w \sim \mathcal{N}(A^{-1}b, A^{-1}).

Consider yxN(wψ(x),σ2)y|x \sim \mathcal{N}(w^{\top}\psi(x), \sigma^2), then

the log likelihood for linear regression is the logp(yx)=i=1Nlogp(yixi)=i=1NlogN(yiwψ(xi),σ2)=N2log2πN2logσ212σ2i=1N(yiwψ(xi))2=const12σ2yΨw2\log p(y|x) = \sum_{i=1}^N \log p(y_i|x_i) = \sum_{i=1}^N \log \mathcal{N}(y_i|w^{\top}\psi(x_i), \sigma^2) = -\frac{N}{2} \log 2\pi - \frac{N}{2} \log \sigma^2 - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - w^{\top}\psi(x_i))^2 = \text{const} - \frac{1}{2\sigma^2} \|y - \Psi w\|^2.

the MAP estimation for regularized linear regression is argmaxwlogp(wD)=argmaxw[logp(w)+logp(Dw)]\arg \max_{w} \log p(w|D) = \arg \max_{w} [\log p(w) + \log p(D|w)] where:

  • logp(Dw)=const12σ2yΨw2\log p(D|w) = \text{const} - \frac{1}{2\sigma^2} \|y - \Psi w\|^2
  • assume prior wN(m,S)w\sim \mathcal{N}(m, S), then logp(w)=12(wm)S1(wm)+const\log p(w) = -\frac{1}{2} (w - m)^{\top} S^{-1} (w - m) + \text{const}
    • Commonly, m=0m = 0 and S=ηIS = \eta I, then logp(w)=12ηw2+const\log p(w) = -\frac{1}{2\eta} \|w\|^2 + \text{const}

That is, the posterior predictive distribution is: p(yx,D)=p(wD)p(yx,w)dwp(y|x, D) = \int p(w|D) p(y|x, w) dw

BLR assumptions and steps

The BLR assumes:

  • Prior wN(0,S)w \sim \mathcal{N}(0, S)
  • Likelihood yx,wN(wψ(x),σ2)y|x, w \sim \mathcal{N}(w^{\top}\psi(x), \sigma^2)
  • S,σ2S,\sigma^2 are known/fixed

Continue on the posterior of regularized linear regression, we have: logp(wD)=logp(w)+logp(Dw)=12wS1w12σ2yΨw2+const=12w(σ2ΨΨ+S1)w+σ2yΨw+const\log p(w|D) = \log p(w) + \log p(D|w) = -\frac{1}{2} w^{\top} S^{-1} w - \frac{1}{2\sigma^2} \|y - \Psi w\|^2 + \text{const} = -\frac{1}{2}w^{\top}(\sigma^{-2}\Psi^{\top}\Psi + S^{-1})w + \sigma^{-2} y^{\top} \Psi^{\top} w + \text{const}

Gaussian prior leads to a Gaussian posterior, and so the Gaussian distribution is the conjugate prior for linear regression model. That is wDN((σ2S1+ΨΨ)1Ψy,σ2(ΨΨ+σ2S)1)w|D \sim \mathcal{N}((\sigma^{2}S^{-1} + \Psi^{\top} \Psi)^{-1} \Psi^{\top}y, \sigma^{2}(\Psi^{\top} \Psi + \sigma^{2} S)^{-1}).

Recall the colsed-form direct solution w=(ΨΨ+λI)1Ψyw = (\Psi^{\top} \Psi + \lambda I)^{-1} \Psi^{\top} y, that is, S=σ2λIS = \frac{\sigma^2}{\lambda} I. From this, we can observe when λ0\lambda \to 0 (i.e. no regularization), the posterior is converges to the mle solution for linear regression.

Since p(yx,D)=p(wD)p(yx,w)dwp(y|x,D) = \int p(w|D) p(y|x,w) dw, then yN(((σ2S1+ΨΨ)1Ψy)ψ(x),ψ(x)(σ2(ΨΨ+σ2S)1)ψ(x)+σ2)y \sim \mathcal{N} (((\sigma^{2}S^{-1} + \Psi^{\top} \Psi)^{-1} \Psi^{\top}y)^{\top} \psi(x), \psi(x)^{\top}(\sigma^{2}(\Psi^{\top} \Psi + \sigma^{2} S)^{-1}) \psi(x) + \sigma^2).

  • Or we can use y=wψ(x)+ϵy = w^{\top}\psi(x) + \epsilon then plugin wDw|D and ϵ\epsilon