Machine Learning

📄️ Conditional Independence and Bayes Nets

denote set $xA = \{xi

📄️ Linear Regression

Recall $Y = f(X) + \epsilon$. Now we first assume it's a linear model, then $Y = \beta0+\beta1X1 +\ldots + \betapXp + \epsilon$ where $f(X) = \beta0+\beta1X1 +\ldots + \betapXp$ is a linear combination of the predictors $X1, X2, \ldots, X_p$ which shows the linear relationship between $X$ and $Y$, that is, the linear regression model.

📄️ Markov Random Fields

Markov Blanket (MB): the set of nodes that makes $X_i$ conditionally independent of the other nodes.

📄️ Probabilistic Graphical Models

We introduce the concept of probabilistic graphical models (PGMs) as a probabilistic model for representing the conditional dependence structure between random variables. Some of the most common PGMs are Markov Random Fields and Bayesian Networks

📄️ Sampling

We have multiple ways to do sampling.

📄️ Hidden Markov Model

In previous courses or even previous lecture, we always generally assume data was i.i.d. for convenience purpose, however this may be a poor assumption. Many real life problems are not i.i.d. instead they are sequential data. That is, we make the simplifying assumption that our data can be modeled as a first-order Markov chain $p(xt|x{1:t-1}) = p(xt|x)$.

📄️ Variational Inference

Recall the posterior distribution $p(z|x) = \frac{p(x,z)}{p(x)}$ is the distribution of the latent variables given the observed data where $p(x) = \int p(x,z) dz$ is the marginal distribution of the observed data. But generally, when we face high dimensional latent variables, it becomes intractable to compute the posterior distribution. Specifically, we have the following problem:

📄️ Mixture of Gaussians (or Gaussian Mixture Model (GMM))

We use GMM when the situation that Gaussian latent variable model $p(x) = \sum_z p(x, z)$ used for clustering.

📄️ Probabilistic Principal Component Analysis

Sometimes data is very high dimensional, its important features can be accurately captured in a low dimensional subspace. That is the purpose we use PCA.

📄️ Bayesian Linear Regression

BLR is used when Gaussian discriminative model $p(y|X)$ used for regression with a Bayesian analysis for the weights.

📄️ Kernal Method

Define $\psi(x): \R^D \to \R^M$ and input data $X \in \R^{N \times D}$ and $\Psi \in \R^{N \times M}$, where $\Psi = \psi(X)$. Then we have the prediction matrix $\hat = \Psi w$. Then $y|x \sim N(w^T\psi(x), \sigma^2)$.

📄️ Basic Information to Multivariate Data

For a multivariate data, denote it with $p$ variables where $p \ge 2$, and with $n$ observations(item/experimental unit). We also denote as $x_$ where the measurement of kth variable on the jth item or experimental unit.

📄️ Moving beyond Linearity

We always make linear assumption to assume a model which make our life easizer. However, linear assumption is not always a good approximation, and sometimes even a poor one. That is, we extend the linear model by feature.

📄️ Classification

We may have a classification problem for a qualitative result from an unordered set $C$. The main goal of us is to:

📄️ Decision Tree

Decision Tree is a supervised learning algorithm that can be used for both classification and regression problems. It is a tree-like structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. The paths from root to leaf represent classification rules. And generally, a decision tree has a high variance and low bias.

📄️ Discriminant Analysis

Discriminant Anaylsis parametrizes the distribution of $X | Y = 1$ and $X | Y = 0$.

📄️ Fitted Model Measurement

To measure the fit of a model, we need to compare the model's prediction with the actual data. The most common way to do this is to use the mean squared error (MSE) or the root mean squared error (RMSE). Depend on the different type of data, we can use different measurement to fit.

📄️ Gradient Descent

Gradient descent is a optimization iterative algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent is to do derivatives to find the critical point. It asks a convex function to execut. That is, we can use grandient descent to find the minimum of a MSE.

📄️ Lasso Regression

The key point of Lasso Regression is to shrinks the coefficients toward 0 by penalizing their absolute values whereas find a model minimize $[\sum{i=1}^n (yi - \beta0 - \sum{j = 1}^p\betaj x)^2] + \lambda\sum{j = 1}^p|\betaj|= RSS + \lambda\sum{j = 1}^p|\betaj|$

📄️ Logistic Regression

Logistic regression is a parametric approach to classification. It gives a structure of probability of $x$ by $p(X) =\frac{e^{\beta0 + \beta X}}{1+e^{\beta0 + \beta X}}$ where $\beta0$ is the intercept and $\beta$ is the coefficient matrix. Where $\frac{p(X)}{1-p(X)} = e^{\beta0 + \beta X}$ is the odds.

📄️ Machine Learning

Machine learning is a subset of artificial intelligence, it is the study of computer algorithms that improve automatically through experience. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

📄️ Model Selection

The way we always use is Subset Selection. We identify a subset of the $p$ predictors that we believe to be related to the response.

📄️ Multivariate Data Analysis among Machine Learning

Linear Regression

📄️ Multivariate Normal Distribution

Multivariate Normal Distribution is a generalization of the normal distribution to multiple dimensions. It is often a good approximation to the true distribution where by Central Limit Theorem, multivariate normal distribution is the sample distribution of many multivariate random variables.

📄️ Some recap of previous courses

Sufficient Statistics

📄️ Ridge Regression

The key point of Ridge Regression is to find a model minimize $[\sum{i=1}^n (yi - \beta0 - \sum{j = 1}^p\betaj x)^2] + \lambda\sum{j = 1}^p\betaj^2= RSS + \lambda\sum{j = 1}^p\betaj^2$ where shrinks the coefficients toward 0

📄️ Situation without the Test Data to do Model Validation

There are two common approaches for model selection when we don't have $D_$

📄️ Support Vector Machine

The trainning pointes with equality constraints $yi(xi^Tw + b) \ge M$ are called support vectors. We have Support Vector Machine (SVM) which is a classifier that finds the optimal hyperplane that separates the classes. SVM-like algorithms are often called max-margin or large-margin. Since the Primal-formulation is convex specially is a quadratic program. We can use SGD/GD to solve it. And its more common to solved by dual formulation.