📄 ️ Conditional Independence and Bayes Nets
denote set $xA = \{xi
📄️ Linear Regression
Recall $Y = f(X) + \epsilon$. Now we first assume it's a linear model, then $Y = \beta0+\beta1X1 +\ldots + \betapXp + \epsilon$ where $f(X) = \beta0+\beta1X1 +\ldots + \betapXp$ is a linear combination of the predictors $X1, X2, \ldots, X_p$ which shows the linear relationship between $X$ and $Y$, that is, the linear regression model.
📄️ Markov Random Fields
Markov Blanket (MB): the set of nodes that makes $X_i$ conditionally independent of the other nodes.
📄️ Probabilistic Graphical Models
We introduce the concept of probabilistic graphical models (PGMs) as a probabilistic model for representing the conditional dependence structure between random variables. Some of the most common PGMs are Markov Random Fields and Bayesian Networks
📄️ Sampling
We have multiple ways to do sampling.
📄️ Hidden Markov Model
In previous courses or even previous lecture, we always generally assume data was i.i.d. for convenience purpose, however this may be a poor assumption. Many real life problems are not i.i.d. instead they are sequential data. That is, we make the simplifying assumption that our data can be modeled as a first-order Markov chain $p(xt|x{1:t-1}) = p(xt|x)$.
📄️ Variational Inference
Recall the posterior distribution $p(z|x) = \frac{p(x,z)}{p(x)}$ is the distribution of the latent variables given the observed data where $p(x) = \int p(x,z) dz$ is the marginal distribution of the observed data. But generally, when we face high dimensional latent variables, it becomes intractable to compute the posterior distribution. Specifically, we have the following problem:
📄️ Mixture of Gaussians (or Gaussian Mixture Model (GMM))
We use GMM when the situation that Gaussian latent variable model $p(x) = \sum_z p(x, z)$ used for clustering.
📄️ Probabilistic Principal Component Analysis
Sometimes data is very high dimensional, its important features can be accurately captured in a low dimensional subspace. That is the purpose we use PCA.
📄️ Bayesian Linear Regression
BLR is used when Gaussian discriminative model $p(y|X)$ used for regression with a Bayesian analysis for the weights.
📄️ Kernal Method
Define $\psi(x): \R^D \to \R^M$ and input data $X \in \R^{N \times D}$ and $\Psi \in \R^{N \times M}$, where $\Psi = \psi(X)$. Then we have the prediction matrix $\hat = \Psi w$. Then $y|x \sim N(w^T\psi(x), \sigma^2)$.
📄️ Basic Information to Multivariate Data
For a multivariate data, denote it with $p$ variables where $p \ge 2$, and with $n$ observations(item/experimental unit). We also denote as $x_$ where the measurement of kth variable on the jth item or experimental unit.
📄️ Moving beyond Linearity
We always make linear assumption to assume a model which make our life easizer. However, linear assumption is not always a good approximation, and sometimes even a poor one. That is, we extend the linear model by feature.
📄️ Classification
We may have a classification problem for a qualitative result from an unordered set $C$. The main goal of us is to:
📄️ Decision Tree
Decision Tree is a supervised learning algorithm that can be used for both classification and regression problems. It is a tree-like structure where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. The paths from root to leaf represent classification rules. And generally, a decision tree has a high variance and low bias.
📄️ Discriminant Analysis
Discriminant Anaylsis parametrizes the distribution of $X | Y = 1$ and $X | Y = 0$.
📄️ Fitted Model Measurement
To measure the fit of a model, we need to compare the model's prediction with the actual data. The most common way to do this is to use the mean squared error (MSE) or the root mean squared error (RMSE). Depend on the different type of data, we can use different measurement to fit.
📄️ Gradient Descent
Gradient descent is a optimization iterative algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent is to do derivatives to find the critical point. It asks a convex function to execut. That is, we can use grandient descent to find the minimum of a MSE.
📄️ Lasso Regression
The key point of Lasso Regression is to shrinks the coefficients toward 0 by penalizing their absolute values whereas find a model minimize $[\sum{i=1}^n (yi - \beta0 - \sum{j = 1}^p\betaj x)^2] + \lambda\sum{j = 1}^p|\betaj|= RSS + \lambda\sum{j = 1}^p|\betaj|$
📄️ Logistic Regression
Logistic regression is a parametric approach to classification. It gives a structure of probability of $x$ by $p(X) =\frac{e^{\beta0 + \beta X}}{1+e^{\beta0 + \beta X}}$ where $\beta0$ is the intercept and $\beta$ is the coefficient matrix. Where $\frac{p(X)}{1-p(X)} = e^{\beta0 + \beta X}$ is the odds.
📄️ Machine Learning
Machine learning is a subset of artificial intelligence, it is the study of computer algorithms that improve automatically through experience. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.
📄️ Model Selection
The way we always use is Subset Selection. We identify a subset of the $p$ predictors that we believe to be related to the response.
📄️ Multivariate Data Analysis among Machine Learning
Linear Regression
📄️ Multivariate Normal Distribution
Multivariate Normal Distribution is a generalization of the normal distribution to multiple dimensions. It is often a good approximation to the true distribution where by Central Limit Theorem, multivariate normal distribution is the sample distribution of many multivariate random variables.
📄️ Some recap of previous courses
Sufficient Statistics
📄️ Ridge Regression
The key point of Ridge Regression is to find a model minimize $[\sum{i=1}^n (yi - \beta0 - \sum{j = 1}^p\betaj x)^2] + \lambda\sum{j = 1}^p\betaj^2= RSS + \lambda\sum{j = 1}^p\betaj^2$ where shrinks the coefficients toward 0
📄️ Situation without the Test Data to do Model Validation
There are two common approaches for model selection when we don't have $D_$
📄️ Support Vector Machine
The trainning pointes with equality constraints $yi(xi^Tw + b) \ge M$ are called support vectors. We have Support Vector Machine (SVM) which is a classifier that finds the optimal hyperplane that separates the classes. SVM-like algorithms are often called max-margin or large-margin. Since the Primal-formulation is convex specially is a quadratic program. We can use SGD/GD to solve it. And its more common to solved by dual formulation.
📄️ Tree Improving
We have many ways to improve the tree, like pruning, boosting, bagging, etc. In this section, we will introduce the bagging, random forest, and boosting.
📄️ Some Unsupervised Learning Model
Unsupervised learning is a the study without labels, normally, is the task of grouping, explaining and finding structured data.
📄️ STA414 Statistical Methods for Machine Learning II
Instructor: Piotr Zwiernik, Murat A. Erdogdu
📄️ STA437 Method for Multivariate Data Analysis
Instructor: Mehdi Molkaraie