MultivariateAnalysisPart1Summary.Rnw

\documentclass[12pt]{article}
%\usepackage[landscape]{geometry}  
\usepackage[landscape,hmargin=2cm,vmargin=1.5cm,headsep=0cm]{geometry} 
% See geometry.pdf to learn the layout options. There are lots.
\geometry{a4paper}                   % ... or a4paper or a5paper or ... 
%\geometry{landscape}                % Activate for for rotated page geometry
%\usepackage[parfill]{parskip}    % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{epstopdf}

\usepackage{framed}

\usepackage{multicol}

\usepackage{graphicx}

 \usepackage{float}
 \setkeys{Gin}{width=0.25\textwidth}

\usepackage[table]{xcolor}

\newcommand\x{\times}
\newcommand\y{\cellcolor{green!10}}

\newcommand{\pder}[2][]{\frac{\partial#1}{\partial#2}}

\newcommand{\argmin}{\arg\!\min}
\newcommand{\argmax}{\arg\!\max}


\newtheorem{definition}{Definition}

\newtheorem{theorem}{Theorem}

\newtheorem{fact}{Fact}

\newtheorem{proposition}{Proposition}


% Turn off header and footer
\pagestyle{plain}
 

% Redefine section commands to use less space
\makeatletter
\renewcommand{\section}{\@startsection{section}{1}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%x
                                {\normalfont\large\bfseries}}
\renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}%
                                {-1explus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%
                                {\normalfont\normalsize\bfseries}}
\renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {1ex plus .2ex}%
                                {\normalfont\small\bfseries}}
\makeatother

% Define BibTeX command
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

% Don't print section numbers
%\setcounter{secnumdepth}{0}

\newcommand{\eps}{\epsilon}
\newcommand{\al}{\alpha}

\setlength{\parindent}{0pt}
\setlength{\parskip}{0pt plus 0.5ex}


\usepackage{Sweave}
\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}

\usepackage{cancel}

%% taken from http://brunoj.wordpress.com/2009/10/08/latex-the-framed-minipage/
\newsavebox{\fmbox}
\newenvironment{fmpage}[1]
{\begin{lrbox}{\fmbox}\begin{minipage}{#1}}
{\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}}

\usepackage{mathtools}
\makeatletter
 
\newcommand{\explain}[2]{\underset{\mathclap{\overset{\uparrow}{#2}}}{#1}}
\newcommand{\explainup}[2]{\overset{\mathclap{\underset{\downarrow}{#2}}}{#1}}
 
\makeatother

\SweaveOpts{prefix.string=MatAlgfigs/MatAlgfig}

\SweaveOpts{cache=TRUE}

\title{Multivariate Analysis Summary Sheet (Part 1)}
\author{Shravan Vasishth (vasishth@uni-potsdam.de)}
\date{\today}                                           % Activate to display a given date or no date

\newcommand\myfigure[1]{%
\medskip\noindent\begin{minipage}{\columnwidth}
\centering%
#1%
%figure,caption, and label go here
\end{minipage}\medskip}


\begin{document}

\SweaveOpts{concordance=TRUE}
\footnotesize
\maketitle
\tableofcontents

\newpage

\begin{multicols}{2}


% multicol parameters
% These lengths are set only within the two main columns
%\setlength{\columnseprule}{0.25pt}
\setlength{\premulticols}{1pt}
\setlength{\postmulticols}{1pt}
\setlength{\multicolsep}{1pt}
\setlength{\columnsep}{2pt}

\begin{center}
     \normalsize{Multivariate Analysis Summary Sheet} \\
    \footnotesize{
    Compiled by: Shravan Vasishth (vasishth@uni-potsdam.de)\\
    Version dated: \today}
\end{center}

<<echo=F>>=
options(width=60)
options(continue=" ")
@

\section{Preliminaries}

\subsection{Some basic results for computing covariance}

Let $\bar{x}_i$ be vectors:

\begin{equation}
var(\bar{x}_1 - \bar{x}_2) = var(\bar{x}_1) + var(\bar{x}_2) - 2 Cov(\bar{x}_1,\bar{x}_2)
\end{equation}

Note that $Cov(\bar{x}_1,\bar{x}_2)$ would be the matrix

$$
\begin{pmatrix}
Cov(x_1) & 0 \\
0 & Cov(x_2)\\
\end{pmatrix}
$$

where $Cov(x_1)$ is the covariance between the \textit{components} of $x_1$.

\subsection{Completing the square: complex numbers}

Completing the square: $\frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

Often, we have imaginary numbers involved. Recall that $i^2 = \sqrt{-1}$. So we will often have roots with a real and complex part:
$4 + \sqrt{-25}= 4 + 5\sqrt{-1}=4 + 5i$. In the TS context, we need to know if the roots lie outside the unit circle; with imaginary components, we can determine this by simply computing the length of the vector:

$\sqrt{4^2 + 5^2}=6.4$.

\subsection{The model matrix}

We will treat $X'$ as the data matrix:

\begin{equation}
X'=\begin{pmatrix}
x_{11} & \dots & x_{1p}\\
\vdots & \vdots & \vdots \\
x_{n1} & \dots & x_{np}\\
\end{pmatrix}
\end{equation}

\subsection{Sample mean}

The \textbf{sample mean vector} is:

\begin{equation}
\bar{x} = \begin{pmatrix}
\bar{x}_1 \\
\vdots\\
\bar{x}_p\\
\end{pmatrix}
=
\frac{1}{n} 
\begin{pmatrix}
x_{11} + \dots + x_{n1}\\
\vdots  \\
x_{1p} + \dots + x_{np}\\
\end{pmatrix}
\end{equation}

More compactly: $\bar{x} = \frac{1}{n}X1 \Leftrightarrow \bar{x}' = \frac{1}{n}1'X'$.

\subsection{Sample variance}

Sample variance (variance-covariance) matrix:

\begin{equation}
S_{p\times p} = var(X') = \frac{1}{n-1} (X-\bar{X}) (X-\bar{X})'
\end{equation}

$\bar{X}_{p\times n}=[\bar{x} \dots \bar{x}]$

\textbf{Properties of S}:

$S$ is symmetric, $S_{ii}$ is sample variance, and $S_{ij}$ is sample covariance.

If columns of S are linearly independent (i.e., if none of the variables is a linear combination of the other), S is non-singular, and positive definite.

The \textbf{sample correlation matrix} R is the same as the vcov matrix, but has entries scaled:

\begin{equation}
r_{ij}=\frac{s_{ij}}{\sqrt{s_{ii}s_{jj}}}
\end{equation}

If $L$ be the diagonal matrix:

$
\begin{pmatrix}
s_{11} & \dots & 0\\
0 & \dots & 0\\
0 & \dots & s_{nn}\\
\end{pmatrix}
$
and $L^{1/2}$ has sds along the diagonal, then

\begin{equation}
S=L^{1/2} R L^{1/2}
\end{equation}

\begin{enumerate}
\item R is symmetric, $p\times p$
\item $r_{ii}=1$ for all i
\item $-1 \leq r_{ij} \leq 1$.
\item Geometrically, $r_{ij}$ is the cosine of the angle between the vectors of deviations of observations of the ith and jth variables from the mean.
\begin{enumerate}
\item If angle is 0 deg (0 rad), $\rho=1$.
\item If angle is 90 deg ($\pi/2$ rad), $\rho=0$.
\end{enumerate}
to-do example
\end{enumerate}

\section{Some useful results}

\begin{enumerate}
\item
Let $w$ be a vector. Then $var(X'w) = w'var(X')w=w'Sw$. 
\item If $A$ is any $p\times q$ matrix, then $Var(X'A)=A'Var(X')A=A'SA$.
\end{enumerate}

\section{Useful properties of eigenvalues and eigenvectors}

Let $A_{p\times p}$, and let eigenvalues be $\lambda_i$.

\begin{enumerate}
\item $\sum \lambda_i = trace(A)$
\item $\prod \lambda_i = det(A)$
\item If $\lambda_i$ is an eigenvalue of A, then there is at least one vector $x_i$, called an eigenvector of A, such that $Ax_i = \lambda_i x_i$.
\item If A is real and symmetric $p\times p$, then there are always p linearly independent eigenvectors.
\item If C is any non-singular square matrix, then A and $CAC^{-1}$ have the same eigenvalues. If $x_i$ is an eigenvector of A with eigenvalue $\lambda_i$, then $Cx_i$ is an  eigenvector of $CAC^{-1}$ with eigenvalue $\lambda_i$. [Prove it]
\item If A is real symmetric, i.e., $A=A'$. Then the eigenvalues of A are real. [Prove it]
\item If A is real symmetric, and $\lambda$ and $\mu$ are eigenvalues, and x and y are corresponding eigenvectors, then x and y are orthogonal. [Prove it]
\item If A is real symmetric, and if X denotes the matrix whose columns are normalized eigenvectors of A, then X is an orthogonal matrix ($XX'=I$, or $X'=X^{-1}$). [Prove it]

Also, $X^{-1}AX=\Lambda$, where $\Lambda$ is the diagonal matrix containing the eigenvalues of A along the diagonal. This is because $AX = X\Lambda$.

The decomposition $X^{-1}AX=\Lambda$ is called the \textbf{spectral decomposition}, and is equivalent to 

\begin{equation}
A=\lambda_1 x_1 x_1' + \dots + \lambda_p x_p x_p'
\end{equation}

where $x_i$ is an eigenvector of A with eigenvalue $\lambda_i$.
\item 
If A is any positive definite real symmetric matrix, it will have positive definite real symmetric square roots. [Prove it]
%If A is any positive definite real symmetric matrix, with orthogonal eigenvectors $x_1,\dots,x_p$, corresponding to eigenvalues $\lambda_1,\dots,\lambda_p$ (which are positive). 
\item The eigenvalues of a variance-covariance matrix are non-negative.
\end{enumerate}

\section{Vector calculus review}

Some useful results:

\begin{enumerate}
\item 
If S is a symmetric $p\times p$ vector, and if $f(x)=x'Sx$. Then $\frac{\delta f}{\delta x} = 2Sx$. [Derive this.]
\item 
If S is not symmetric, $\frac{\delta f(x)}{\delta x} = (S + S')x$
\item
If S is I, then $\frac{\delta f(x)}{\delta x}=\frac{\delta x'x}{\delta x}=2x$.
\item If a is a constant p-vector, and $f(x)=a'x$, then $\frac{\delta f}{\delta x}=a$. 
\end{enumerate}

\section{Constrained optimization (Lagrange Multipliers)}

Suppose $t_1, \dots, t_n$ are unbiased estimates of $\theta$, and variance of $t_i$, $i=1,\dots,n$, is $\sigma_i^2$.

<<>>=
## t_i, equal variance
t<-rnorm(10)
@

Find \textbf{B}est \textbf{L}inear \textbf{U}nbiased \textbf{E}stimator of $\theta$.

Solution:

The BLUE of $\theta$ will be a weighted sum of the $t_i$. Let this weighted sum be $\tau = \sum a_i t_i$, such that $E[\tau]=\theta$. 

We need to minimize the variance: $Var(\tau)=Var(\sum a_i t_i)=\sum a_i^2 Var(t_i)=\sum a_i^2\sigma_i^2$.

Since $E[t_i] = E[\tau]=\theta$, the weights $a_i$ must sum to 1. So the constraint is that $\sum a_i = 1$. 

So, we minimize this function:

\begin{equation}
\Omega = \sum a_i^2\sigma_i^2 + \lambda (\sum a_i - 1)
\end{equation}

Differentiating with respect to  each $a_i$, we get: $2 a_i\sigma_i^2 +\lambda = 0$, which implies that

\begin{equation}\label{ai}
a_i = -\frac{\lambda}{2 \sigma_i^2}
=
-\frac{1}{2}\frac{\lambda}{\sigma_i^2}
\end{equation}

Differentiating with respect to the Lagrangian, we get

\begin{equation}
\sum a_i - 1 = 0
\end{equation}

Replacing $a_i$ in the above with the RHS in equation~\ref{ai},

\begin{equation}
\sum  -\frac{1}{2}\frac{\lambda}{\sigma_i^2} - 1 = 0
\end{equation}

Adding +1 to both sides:

\begin{equation}
\sum  -\frac{1}{2}\frac{\lambda}{\sigma_i^2}  = 1
\end{equation}

Multiplying both sides by -1:

\begin{equation}
\sum  \frac{1}{2}\frac{\lambda}{\sigma_i^2}  = -1
\end{equation}

Solve for $\lambda$ (change index i to k):

\begin{equation}
\lambda   = \frac{-1}{\frac{1}{2}}\frac{1}{\frac{1}{\sum \sigma_k^2}} = \frac{-2}{\frac{1}{\sum \sigma_k^2}}
\end{equation}

Now we can figure out each $a_i$ by plugging in $\lambda$ into 

\begin{equation}
a_i = -\frac{1}{2}\frac{\lambda}{\sigma_i^2} = -\frac{1}{2}\frac{\frac{-2}{\frac{1}{\sum \sigma_k^2}}}{\sigma_i^2} = \frac{1}{\sigma_i^2} \left[ \frac{1}{\sum \sigma_k^2}\right]^{-1}
\end{equation}

Finally, we need to plug in the definition of $a_i$ into $\tau=\sum a_i t_i$. At this stage it makes sense to use the index i again (instead of k):

\begin{equation}
\tau = \sum a_i t_i = \sum \left[ 
\frac{1}{\sigma_i^2} \left[ \frac{1}{\sum \sigma_k^2}\right]^{-1}
\right] t_i = \sum \frac{t_i}{\sigma_i^2} \left[ \frac{1}{\sum \sigma_k^2}\right]^{-1}
\end{equation}

\section{Multivariate distributions}

Definition: If $\mu$ is a p-vector and $\Sigma$ is a positive definite symmetric $p\times p$ matrix, then MVN distribution $N_p(\mu,\Sigma)$ is:

\begin{equation}
f_x(x) = \frac{1}{(2\pi)^{p/2} \mid \Sigma \mid^{1/2} } \exp \left( -\frac{1}{2} (x-\mu)' \Sigma^{-1} (x-\mu) \right)
\end{equation}

\begin{enumerate}
\item
The quadratic form $(x-\mu)' \Sigma^{-1} (x-\mu)$ in the kernel is a statistical distance measure; for any value of x, the quadratic form gives the squared statistical distance of x from $\mu$, called squared Mahalanobis distance.  
\item
Note that the MVN density is constant on surfaces of contours where

$(x-\mu)' \Sigma^{-1} (x-\mu)=c^2$

``The axes of each ellipsoid of constant density are in the direction of the eigenvectors of $\Sigma^{-1}$ (recall that these are the same as the eigenvectors of $\Sigma$, but if $\Sigma x=\lambda x$, then
$\Sigma^{-1} x=\lambda^{-1} x$), and their lengths are proportional to the reciprocals of the square roots of the eigenvalues of $\Sigma^{-1}$.'' (p.\ 95)

\item 
If $x \sim N_p(\mu,\Sigma)$,
then

\begin{enumerate}
\item
$(x-\mu)' \Sigma^{-1} (x-\mu)\sim \chi_p^2$.
\item 
The solid ellipsoid $\{x\mid (x-\mu)' \Sigma^{-1} (x-\mu) \leq \chi_p^2(\alpha)\}$ has probability $1-\alpha$.
\end{enumerate}

This follows from the fact that if $x\sim N_p(\mu,\Sigma)$ then 
$y=\Sigma^{-1/2} (x-\mu)\sim N_p(0,I_p)$ and therefore:

\begin{equation}
y'y = (x-\mu)' \Sigma^{-1} (x-\mu) = \sum_{i=1}^2 Y_i^2 \sim \chi_p^2
\end{equation}

``One of the consequences of the properties is that the marginal distributions of the individual variables of a multivariate normal distribution is a univariate normal distribution.'' (p.\ 96)

\item
If $X\sim N_p(\mu,\Sigma)$ and w is a p-vector, then the linear combination $w'X \sim N(w'\mu,w'\Sigma w)$.
\item 
If $X\sim N_p(\mu,\Sigma)$ and A is a $q\times p$ matrix, then the linear combination $AX \sim N(A\mu,A\Sigma A')$.
\item 
If $X\sim N_p(\mu_X,\Sigma_X)$ and $Y\sim N_q(\mu_Y,\Sigma_Y)$, then the p+q vector 

$\begin{pmatrix}
X\\
Y
\end{pmatrix}
\sim N_{p+q}\left( 
\begin{pmatrix}
\mu_X\\
\mu_Y
\end{pmatrix},
\begin{pmatrix}
\Sigma_X & 0 \\
0 & \Sigma_Y\\
\end{pmatrix}
\right)
$

as long as X and Y are independent.
\item 
If
$\begin{pmatrix}
X\\
Y
\end{pmatrix}
\sim N_{p+q}\left( 
\begin{pmatrix}
\mu_X\\
\mu_Y
\end{pmatrix},
\begin{pmatrix}
\Sigma_X & \Sigma \\
\Sigma' & \Sigma_Y\\
\end{pmatrix}
\right)
$

then X and Y are independent iff $\Sigma=0$.

\end{enumerate}

\section{Principal Components Analysis}

Useful mostly when we have continuous data.

[Sources: Tutorial on PCA by Shlens, ]

Let $X_{n\times p}$ be our data, where we have p different variables, and n measurements on each variables. Example:

<<>>=
n<-5
x1<-c(1,3,5,7,9)
x2<-c(4,7,8,11,15)
## n=5, p=2:
X<-data.frame(x1=x1,x2=x2)
## centered variables:
x1<-scale(x1,scale=F)
x2<-scale(x2,scale=F)
X<-data.frame(x1,x2)
X<-as.matrix(X)
@

The matrix $\frac{1}{n-1}X'X$ is the real symmetric variance-covariance matrix, and represents the relationships between the variables:

<<print=TRUE>>=
var1<-(1/(n-1))*t(X)%*%X
var2<-var(X)
@

The total amount of dispersion on the data is the sum of the variances (the trace, which is the sum of the eigenvalues of the vcov matrix):

<<>>=
## sum of the variances=the total dispersion:
sum(diag(var2))
## sum of the eigenvalues:
sum(eigen(var2)$values)
@

Note that  \textbf{the eigenvalues of the covariance matrix encode the variability of the data in an orthogonal basis that captures as much of the data's variability as possible in the first few basis functions (i.e., the principle component basis)}.

\begin{fact}
Any symmetric variance-covariance matrix X is 
diagonalized by an orthogonal matrix of its eigenvectors (see Theorems 1 and 2 in MatrixAlgebraSummary.pdf). 
For a symmetric matrix X, $X = EDE^T$, 
where D is a diagonal matrix and E is a matrix of eigenvectors of X arranged as columns.
\end{fact}


\textbf{PCA using eigenvalue decomposition}:

\textit{Eigenvalues of covariance matrices can be interpreted as variances of particular linear combinations of the component random variables. In particular, $\lambda_1$ is the largest variance obtainable by taking linear combinations of X with weights having unit length.} Thisted, p.\ 120.

The goal is to find some orthonormal matrix P in $Y = PX$ such that the covariance matrix $C_Y = \frac{1}{n} Y Y^T$ is a diagonal matrix. The rows of P are the \textit{principal components} of X.

First, write $C_Y$ in terms of the unknown Y:

\begin{equation}
\begin{split}
C_Y =& \frac{1}{n} Y Y^T\\
    =& \frac{1}{n} (PX)(PX)^T\\
    =& \frac{1}{n} PXX^T P^T\\
    =& P (\frac{1}{n} XX^T) P^T\\
C_Y =& PC_X P^T \\   
\end{split}
\end{equation}

Choose P to be a matrix such that each row $p_i$ is an eigenvector of $Cov_X$. So $P=E^T$ where E has the eigenvectors of X in each column. Also note that $P^{-1} = P^T$.

Next, we show that this choice of P diagonalizes $C_Y$---that's the goal of PCA.
\begin{equation}
\begin{split}
C_Y =& PC_X P^T\\
    =& P(E^TDE)P^T\\
    =& P(P^TDP)P^T\\
    =& (PP^T) D (PP^T)\\
    =& (PP^{-1}) D (PP^{-1})\\
C_Y    =& D
\end{split}
\end{equation}

Key point: the i-th diagonal value $C_Y$ is the variance of X along the principal component (an eigenvector) $p_i$.

The real symmetric var-cov matrix var2 can be decomposed into $U\lambda U'$, where

\begin{enumerate}
\item
U is the orthonormal matrix containing the eigenvectors of var2
\item 
$\lambda$ is a diagonal matrix containing the eigenvalues of var2
\end{enumerate}

<<>>=
(lambda<-diag(eigen(var2)$values))
U<-eigen(var2)$vectors
## recovers original vcov matrix:
U%*%lambda%*%t(U)
@

Aside: you could do another PCA on the lambda
matrix. Note that (a) the eigenvalues would remain unchanged, and the 
eigen vectors will be:

<<>>=
eigen(lambda)$values
eigen(lambda)$vectors
@


We want to linearly transform the vectors 
x1 and x2:

\begin{center}
\includegraphics[height=5cm,width=7cm]{bestfitPC}
\end{center}

In this figure, the largest direction of variance lies along the best fit line, not perpendicular to the best fit line. Thus, by assumption, the dynamics of interest lie along the direction with largest variance. Maximizing the variance corresponds to finding the appropriate rotation of the naive basis.

\textbf{By transforming $X'$ to $Y'=aX'$, we have projected $X'$ onto $a$, a one-dimensional space, a single line. The values of $Y'$ give the co-ordinates of each observation along the vector a}.

For example, if we have 
$x_1 = 
\begin{pmatrix}
1\\
2
\end{pmatrix}, 
x_2 = 
\begin{pmatrix}
2\\
1
\end{pmatrix}, 
x_3 = 
\begin{pmatrix}
-1\\
1
\end{pmatrix}$, and 
$a = 
\begin{pmatrix}
\frac{3}{5}\\
\frac{-4}{5}
\end{pmatrix}$, then 

$X' = 
\begin{pmatrix}
1 & 2\\
2 & 1\\
-1 & 1 
\end{pmatrix}$. 

Therefore $Y' = X'a = 
\begin{pmatrix}
-1 \\
\frac{2}{5}\\
-\frac{7}{5}
\end{pmatrix}$.

So, $x_1 = -1.a + z_1$ where $z_1$ is orthogonal to a. And so on.

<<echo=FALSE>>=
op<-par(mfrow=c(1,2),pty="s")
plot(x1,x2,xlim=c(-6,6),ylim=c(-6,6),
     main="")
abline(lm(x2~x1))
## First PC is always along the line
## of best fit:
arrows(x0=0,y0=0,x1=5*U[1,1],y1=5*U[2,1],lwd=2)
arrows(x0=0,y0=0,x1=5*U[1,2],y1=5*U[2,2],lty=2)
#legend(x=-5,y=20,lty=c(1,2),legend=c("PC1","PC2"))
library(MVA)
bvbox(X)
arrows(x0=0,y0=0,x1=5*U[1,1],y1=5*U[2,1],lwd=2)
arrows(x0=0,y0=0,x1=5*U[1,2],y1=5*U[2,2],lty=2)
@

If the maximum variance were \textit{not}
along the line of best fit, but in a perpendicular direction, the first PC would point in that direction:

<<echo=FALSE>>=
x1<-c(1,2,3,3,2)
x2<-c(20,55,4,30,5)
x1<-scale(x1,scale=F)
x2<-scale(x2,scale=F)
op<-par(mfrow=c(1,2),pty="s")
plot(x1,x2,xlim=c(-33,33),ylim=c(-33,33))
abline(lm(x2~x1))
X<-data.frame(x1,x2)
X<-as.matrix(X)
lambda<-eigen(var(X))$values
U<-eigen(var(X))$vectors
arrows(x0=0,y0=0,x1=20*U[1,1],y1=20*U[2,1])
arrows(x0=0,y0=0,x1=20*U[1,2],y1=20*U[2,2],lty=2)

bvbox(X)
arrows(x0=0,y0=0,x1=5*U[1,1],y1=20*U[2,1],lwd=2)
arrows(x0=0,y0=0,x1=U[1,2],y1=U[2,2],lty=2)
@
\begin{center}
\includegraphics[height=5cm,width=7cm]{PC2}
\end{center}


Assumption (sometimes incorrect): \textbf{Large variances have important structure}. 
Therefore maximize variance to find the most important PCs.

The reasoning is that if variance is small, then all the observations have similar values, so there is little information in the data. If variance is large, then we have more ``information''.

\subsection{Geometric interpretation}

Basically, the PCA method just rotates the data so that the ellipsoid's axes are the principal components.

\subsection{Using unbiased or MLE variance formula}

The variance matrix using prcomp is $c = \frac{n}{n-1}$ times the variance matrix
using princomp. 
Scaling a matrix S by a factor c to a matrix cS scales the eigenvalues
by a factor of c also; the eigenvectors remain unchanged, however. Since the eigenvalues are all scaled equally, the scree graph would be simply scaled in the vertical direction by a factor of c; the relative contributions of the eigenvalues will be unchanged.


\subsection{Presentation from lecture notes}

\textbf{Goal}: reduce $X_{n\times p}$ to $Y_{n\times q}$

``A linear transformation $X' \rightarrow Y'$ is given by $Y'=X'A$ where A is a $p\times p$-matrix; it makes statistical sense to restrict attention to non-singular matrices A. If A happens to be an orthogonal matrix, i.e., $A'A=I_p$,then the transformation $X'\rightarrow Y'$  is an orthogonal transformation (i.e., just a rotation and/or a reflection of the n points in p-dimensional space)''.

We want to choose A such that variance $X'A$ is maximized; this is because ``maximizing the variance corresponds to finding the appropriate rotation of the naive basis'' (Shlens tutorial). 

How to choose A such that variance of $X'A$ is maximized? If we just choose some A:
<<>>=
m<-matrix(rnorm(1000),byrow=TRUE,ncol=10)
diag(var(m))[1:3]
a<-matrix(rnorm(100),byrow=TRUE,ncol=10)
@
You can increase variance arbitrarily
by multiplying a with some number, here 2:

<<>>=
head(round(diag(m1<-var(m%*%a)),digits=2))
head(round(diag(m2<-var(m%*%(2*a))),
           digits=2))
@
Notice that variances have increased by
a factor of $2^2$:

<<>>=
diag(m2)/diag(m1)
@

\textbf{The point here}: The ``maximum variance'' here is unbounded. We can increase it arbitrarily. To maximize variance, we have to constrain it somehow, and that's why a constraint is imposed (below), that $w_i' w_i$ sums to 1 and the eigenvectors $w_i, w_j$ ($i\neq j$) are orthogonal.

\begin{theorem}
The $p$ principal components of data $X'$ are the $p$ eigenvectors $a_1,\dots,a_p$ corresponding to the $p$ ordered eigenvalues $\lambda_1 \geq \dots \geq \lambda_p$ of S, the variance of $X'$.
[Theorem 3.3 in lecture notes]
\end{theorem}

\begin{definition}
The first principal component is the vector $a_1$ such that the projection of the data $X'$ onto $a_1$, i.e., $X'a_1$, has maximal variance, subject to the normalising constraint that $a_1'a_1 = 1$ (i.e., $a_1$ has length 1).
\end{definition}

\subsection{PCA example by hand}

To find first PC, we maximize $w_1' S w_1$ subject to $w_1' w_1=1$, where $S=X'X$. $w_1$ will be the eigenvector representing the direction of the maximum variance.

\begin{equation}
w_1' S w_1 - \lambda_1 (w_1' w_1 -1)
\end{equation}

Differentiating with respect to $w_1$ gives:

\begin{equation}
S w_1 - \lambda_1 w_1 = 0 \Leftrightarrow 
S w_1 = \lambda_1 w_1
\end{equation}

But this means that $\lambda_1$ is an eigenvalue of S, and that $w_1$ is the corresponding eigenvector.

When finding the second principal component, we want to ensure that $w_2$ is uncorrelated to $w_1$. This means that $w_2' S w_1=0$, but since $S w_1 = \lambda_1w_1$, we have $w_2' w_1 = 0$.

So the second PC $w_2$ should have the property that the projection of $X'$ onto $w_2$ should have maximum variance subject to $w_2'w_2=1$ and 
$w_2' w_1 = 0$. Since we have two constraints we will use two Lagrangian multipliers:

\begin{equation}
\Omega_2 = w_2' S w_2 -\mu w_2' w_1 - \lambda_2(w_2' w_2 -1 )
\end{equation}

Differentiate with respect to $w_2$:

\begin{equation} \label{e0}
2S w_2 -\mu w_1 - 2\lambda_2 w_2=0
\end{equation}

Pre-multiplying each side with $w_1'$:

\begin{equation}\label{e1}
2w_1'S w_2 -\mu w_1'w_1 - 2\lambda_2 w_1'w_2=0
\end{equation}

Pre-multiplying equation~\ref{e0} by $w_2'$:

\begin{equation}\label{e3}
2w_2'S w_2 -\mu w_2'w_1 - 2\lambda_2 w_2'w_2=0
\end{equation}

Because $\lambda_2 w_1'w_2=0$, we can see from equation~\ref{e1} that

\begin{equation}
\mu = 2w_1'S w_2 = 2(Sw_1)' w_2 = 2(\lambda_1 w_1)' w_2 = 2\lambda_1 w_1' w_2=0
\end{equation}

From equation~\ref{e0}, we can then conclude that

\begin{equation}
S w_2 = \lambda_2 w_2
\end{equation}

Therefore, $w_2$ is an eigenvector of S, with eigenvalue $\lambda_2$.

Recall that $var(X'w_2) = w_2'var(X')w_2 = w_2'Sw_2$. Equation~\ref{e3} is that:

\begin{equation}
2w_2'S w_2 -\mu w_2'w_1 - 2\lambda_2 w_2'w_2=0
\end{equation}

Replacing $w_2'S w_2$ with $var(X'w_2)$, we get that:

\begin{equation}
2var(X'w_2) -\explain{\mu w_2'w_1}{=0} - 2\lambda_2 w_2'w_2=
var(X'w_2) - \lambda_2 = 0
\end{equation}

Therefore, $var(X'w_2) = \lambda_2$. This entails that $\lambda_2$ must be the second largest eigenvalue with eigenvector $w_2$. (Question: why is this entailed?)

All Principal Components can be found using the diagonalization of S. The eigenvalue decomposition of S is also called the \textbf{spectral decomposition} of S. The set of eigenvalues is called the \textbf{spectrum} of S.

\subsection{Computing PCs using eigen()}

<<>>=
x<-c(1,3,5,7,9)
y<-c(4,7,8,11,15)
plot(x,y)
xy<-data.frame(x=x,y=y)
var_matrix<-var(xy)
lambda<-eigen(var_matrix)$values
eigenvectors<-eigen(var_matrix)$vectors
## first principal component 
## contains 99.2% of 
## the information:
lambda[1]/(lambda[1]+lambda[2])
## using prebuilt function:
xy_pc<-princomp(xy)
plot(xy_pc)
#source("code/scree.R")
@

\subsection{How to map the points onto the PC}

Take the first example above:

<<>>=
n<-5
x1<-c(1,3,5,7,9)
x2<-c(4,7,8,11,15)
## n=5, p=2:
## centered variables:
x1<-scale(x1,scale=F)
x2<-scale(x2,scale=F)
X<-data.frame(x1,x2)
X<-as.matrix(X)
S<-var(X)
lambda<-eigen(S)$values
U<-eigen(S)$vectors
@

The first set of points $x_1=(1,4)$, and $e_1$ is the first PC.

\begin{equation}
x_1  = \bar{x} + te_1 + ue_2
\end{equation}

$\bar{x}=0$ as data are centered, so we have:

\begin{equation}
x_1  = te_1 + ue_2
\end{equation}
 Premultiplying by $e_1$:
 
\begin{equation}
e_1' x_1  = e_1' te_1 + e_1' ue_2 = t
\end{equation}

So, $t= e_1' x_1$ (\textbf{or}: $t= x_1' e_1$), so $x_1=t e_1$. I.e.,
$t e_1$ is the point where 
$x_1$ is mapped on to on the first PC. 

<<>>=
x1<-c(1,3,5,7,9)
x2<-c(4,7,8,11,15)
## n=5, p=2:
X<-data.frame(x1=x1,x2=x2)
## centered variables:
x1<-scale(x1,scale=F)
x2<-scale(x2,scale=F)
X<-data.frame(x1,x2)

(X<-as.matrix(X))
(S<-var(X))
(lambda<-eigen(S)$values)
(U<-eigen(S)$vectors)
@

<<fig=FALSE>>=
## plot centered values
plot(x1,x2,xlim=c(-6,6),ylim=c(-6,6),
     main="")
## fit lm line:
abline(lm(x2~x1))
## First PC is always along the line
## of largest variance.
## Use the eigenvectors to draw new axes:
U
## first axis, first PC:
arrows(x0=0,y0=0,x1=5*U[1,1],y1=5*U[2,1],lwd=2)
## second axis, second PC:
arrows(x0=0,y0=0,x1=5*U[1,2],y1=5*U[2,2],lty=2)
## the original first point (centered):
arrows(x0=0,y0=0,x1=x1[1],y1=x2[1],lty=1,lwd=2)
## multiply first PC with first point's coords:
t<-t(U[,1])%*%X[1,]
## the first point is mapped to U1:
U1<-t*U[,1]
## the mapping of the first point to the first PC:
arrows(x0=0,y0=0,x1=U1[1],y1=U1[2],lty=1,lwd=3)
@

\includegraphics[height=5cm,width=7cm]{PCmapping}

\subsection{Computing PCs using the correlation matrix}

Needed when data are of mixed type, or one variable has very high variance compared to others. 

\textbf{Rule of thumb}:
``If the largest variance is more than about 4 times the smallest, then use the correlation matrix (otherwise the variables with large variance will dominate the principal component calculation).''

One should use the vcov matrix if
the scales of measurement are similar,  and the standard deviations of all variables are similar.

\textbf{Note}: The sort of interpretation given in the turtle example is valid only really when the variance matrix is used rather than the correlation matrix.

Interpretation of loadings is only viable for small numbers of variables.


Steps:

\begin{enumerate}
\item Center and scale data to get a correlation matrix:

<<>>=
x1<-c(1,2,4,4,5,6,8,10)
x2<-c(2,3,6,7,8,11,13,14)
x1<-scale(x1,scale=F)
x2<-scale(x2,scale=F)
X<-data.frame(x1,x2)
X<-as.matrix(X)
S<-var(X)
lambda<-eigen(S)$values
U<-eigen(S)$vectors
S_corr<-cov2cor(S)
@

\end{enumerate}

\subsection{Quadratic PCA}

With a very small number of variables, one might try to generalise to e.g., quadratic principal components by adding variables for each quadratic combination.

\subsection{Other things to look up}

\begin{enumerate}
\item Projection pursuit
\item ICA
\item Factor analysis
\item Kohonen's SOMs
\item Generative topographic mapping
\end{enumerate}

\section{Multidimensional scaling}

Goal: visualize a proximity matrix, if possible with a  good lower-dimensional approximation (similar to PCA).

\textbf{The mathematical problem}: $n\times n$ proximity matrix (a symmetric matrix of $\delta_{ij}$ dissimilarities), find a q-dimensional space such that the calculated distance matrix $d_{ij}$ reasonably matches the given dissimilarity matrix $\delta_{ij}$. We can generally find one for $q=n-1$. The more interesting case is when q is very small.

Aims of MDS:

\begin{enumerate}
\item To learn about the measure of dissimilarity itself
\item To discover underlying structure in the data
\item
To see whether the data naturally divides into groups (clustering)
\end{enumerate}


\subsection{Principal coordinate analysis (classical metric scaling)}

PCA = classical MDS

\begin{theorem}
D is a matrix of Euclidean distances iff B is  positive semidefinite (iff all eigenvalues of B are semidefinite, i.e., eigenvalues are positive).\hfill
[Theorem 4.2, p.\ 52]
\end{theorem}

Given a distance matrix $D=(\delta_{ij})$, we find the configuration of points:

\begin{enumerate}
\item 
Find B=HAH, where $A=-\frac{1}{2} \delta_{ij}^2$, and $H=I_n - \frac{1}{n} J_n$. $J_n$ has all entries = 1.
\item Find eigenanalysis of B.
\item Transpose matrix of eigenvectors.
\item The columns of this transposed matrix are the principal coordinates of the points.
\end{enumerate}

\textbf{Example}:

<<>>=
x<-c(1,3,5,7,9)
y<-c(4,7,8,11,15)
D<-matrix(rep(NA,25),ncol=5)
## compute Eudlidean distance:
for(i in 1:5){
  for(j in 1:5){
    d<-(x[i]-x[j])^2+(y[i]-y[j])^2
    D[i,j]<-d
  }
}
A<- -0.5 * D
H<-diag(5) - (1/5)*matrix(rep(1,25),ncol=5)

B<- H%*% A %*% H
eigen_B<-eigen(B)
head(eigen_B$values,n=2)
## take first eigenvector:
v1<-eigen_B$vectors[,1]
f<-v1*sqrt(eigen_B$values[1])
## D2 fits with D:
D2<-matrix(rep(NA,25),ncol=5)
for(i in 1:5){
  for(j in 1:5){
    d<-(f[i]-f[j])^2
    D2[i,j]<-d
  }
}
@

\subsection{Stress}

Let $\delta_{ij}$ distance matrix, $d_{ij}$ guess. Stress is then defined as:

\begin{equation}
S=\sqrt{\frac{(d_{ij}-\delta_{ij})^2}{\sum_{i<j} d_{ij}^2}}
\end{equation}

Stress = 0 is a good guess, so the goal is to minimize stress.

\subsection{Ordinal MDS (non-metrical methods)}

The goal here is to find points with same relative ordering.

Method: minimize stress subject to ordering constraints. See  p.\ 59 for procedure.

If two points are close together they may or may not be similar. Further dimensions might be needed to separate them.

MSTs: give direct impression of nearest neighbors. If two points are near each other in scaling plot but not not in MSTs, then they are not similar.


%\section{Cluster analysis}

%http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters?imm_mid=0bf394&cmp=em-strata-na-na-newsltr_20140702_elist

\section{Multivariate analysis}

\underline{$\mu=(c_1 c_2)'$ (multivariate normal)}

\begin{equation}
T^2 = (\bar{x}-\mu_0)' S^{-1}(\bar{x}-\mu_0) 
\quad T^2(p,n-1) = \frac{(n-1)p}{n-p} F_{p,n-p}
\end{equation}

So, $T^2 \frac{n-p}{(n-1)p} \sim F_{p,n-p}$.


Note: $t$ value in one sample t-test $t^2=F$. So, 

\begin{equation}
T^2 \frac{(n-p)}{(n-1)p} = t^2
\end{equation}

\underline{$\mu=(c_1 c_2)'$ (not normal)}

\begin{equation}
n(\bar{x}-\mu_0)' S^{-1}(\bar{x}-\mu_0) \sim \chi_{p}^2
\end{equation}


\underline{$H_0: \mu_1=\mu_2$}

From the lecture notes (p.\ 102): obtain the sample Mahalanobis distance $D^2$ and reject $H_0$ if 

\begin{equation}
\frac{n_1 n_2 (n-p -1 )}{n(n-2)p} D^2 > F_{p,n-p-1}(\alpha)
\end{equation}

where $D^2$:

\begin{equation}
D^2= (\bar{x}_1-\bar{x}_2)' S^{-1}(\bar{x}_1-\bar{x}_2)
\end{equation}

and the pooled variance S is:

\begin{equation}
S = \frac{(n_1 -1)S_1 + (n_2 -1)S_2}{n-2}
\end{equation}


\underline{ $H_0: \mu_1 = \mu_2$ using contrast coding}

To compare two means (not vectors), given sample means $\bar{x}$ and variance-covariance matrix $S$.

Let $C'=(-1, 1)$ be a qxp matrix, and compute

\begin{equation}
T^2 = n\bar{x}' C (C'SC)^{-1} C'\bar{x}
\end{equation}

against: $F(q,n-1)$.

First compute $C'\bar{x}$ and then $C'SC$, computations are simpler then.

Another example:

If testing $2\mu_1 = 3\mu_2$ the null hypothesis is $H_0: 2\mu_1 - 3\mu_2 = 0$, so use $C'=(2,-3)$.

\subsection{Exam examples}

Placebo vs beta blocker example.

<<>>=
p<-2
n<-22
n1<-11
n2<-11
X1<-matrix(c(138.2,91.4))
S1<-matrix(c(48.9,48.3,48.3,59.3),ncol=2)
X2<-c(130.8,86.7)
S2<-matrix(c(52.6,49.1,49.1,64.4),ncol=2)
S<-((n1-1)*S1+(n2-1)*S2)/(n-2)
invS<-solve(S)
D2<-t(X1-X2)%*%invS%*%(X1-X2)
T2 <- ((n1*n2)/n)*D2
# t-value:
sqrt(T2)
## crit t:
pt(2.876117,df=21,lower.tail=FALSE)
## F-score:
(T2*(n-p-1))/((n-2)*p)
## F:
pf(3.929223,df1=2,df2=19,lower.tail=FALSE)
@

If this were within subjects:

<<>>=
n<-11
@

\subsection{Assessing normality}

One can plot $D^2$ against the appropriate chi-squared distribution:

<<>>=
library(MVA)
#chiplot
@

\subsection*{Analysis using the library ICSNP}

Example code (From ex 3):

\begin{verbatim}
m1<-with(shapedata,
         HotellingsT2(cbind(taper,point) ~ batch))
\end{verbatim}         
     

\section{Statistical discrimination analysis}

Setup: suppose we have k categories, and p measurements in each category.

Example:
<<>>=
## p=2 measures, k=2 categories:
data(iris)
head(iris[,c(1,2,5)],n=2)
@

We need a discriminant rule $f: X_{n\times p} \rightarrow$  category.

\subsection{The maximum likelihood discriminant rule}

This rule allocates x to whichever category that gives greatest likelihood to x.

I.e., we allocate to category 1 if $f_1(x)>f_2(x)$.

\textbf{Example 1}

Given two univariate normal populations. p=1, k=2, $x\sim N(\mu_i, \sigma_i^2)$, if x is in category i. Allocate x to category i if $f_1(x)>f_2(x)$, i.e., if:

\begin{equation}
\frac{\sigma_2}{\sigma_1} \exp(-\frac{1}{2}(\frac{x-\mu_1}{\sigma_1})^2 + \frac{1}{2}(\frac{x-\mu_2}{\sigma_2})^2) > 1
\end{equation}

i.e., if 

\begin{equation}
Q(x) = x^2 (\frac{1}{\sigma_1^2} - \frac{1}{\sigma_2^2}) - 2x (\frac{\mu_1}{\sigma_1^2}-\frac{\mu_2}{\sigma_2^2}) + ((\frac{\mu_1}{\sigma_1^2}-\frac{\mu_2}{\sigma_2^2})-2 \log (\frac{\sigma_2}{\sigma_1}))>0
\end{equation}

If $\sigma_1 > \sigma_2$, so that the coefficient of $x^2$ is negative. Then $Q(x)<0$ if x is sufficiently small or sufficiently big.

If $\sigma_1 = \sigma_2$, then $\log(\sigma_2/\sigma_1)$ is 0. In this case, allocate to category 1 if 

$\mid x-\mu_1\mid < \mid x - \mu_2 \mid$.

\textbf{Example 2}

Suppose we have p dimensions, k normal populations, and means $\mu_i$ and variance $\Sigma$.

\begin{equation}
f_i(x) = (2\pi)^{-p/2}\mid \Sigma \mid^{-1/2} \exp(-(x-\mu_i)'\Sigma^{-1} (x-\mu_i)/2)
\end{equation}

This is maximized when $(x-\mu_i)'\Sigma^{-1} (x-\mu_i)$ is minimized. I.e., we allocate x to the category that has the smallest Mahalnobis distance from x.

When k=2, this rule reduces to the rule that we allocate x to category 1 if 

\begin{equation}
(\mu_1 - \mu_2)'\Sigma^{-1}(x-\mu) > 0 \quad \mu = (\mu_1+\mu_2)/2
\end{equation}


\subsection{Mahalanobis distance between two populations}

$\delta_{12} = \sqrt{(\mu_1 - \mu_2)' \Sigma^{-1} (\mu_1 - \mu_2)}$

If the covariance matrix is the identity matrix, we get Euclidean distance. 

\subsection{Sample ML Discrimination Rule}

Let 

\begin{equation}
W = \frac{1}{n-2} \sum_{i=1}^2 (n_i - 1) S_i
\end{equation}

Rule: allocate to category 1 if:
\begin{equation}
(\bar{x}_1-\bar{x}_2)' W^{-1} (x-(\bar{x}_1 + \bar{x}_2)/2) > 0
\end{equation}

\textbf{Example}: R example suppressed in this file.

<<echo=FALSE>>=
setosa<-subset(iris,Species=="setosa")[,c(1:2,5)]
versi<-subset(iris,Species=="versicolor")[,c(1:2,5)]
Setosa<-setosa[,1:2]
Versi<-versi[,1:2]
barx1<-colMeans(Setosa)
barx2<-colMeans(Versi)
varX1<-var(Setosa)
varX2<-var(Versi)
n1<-dim(Setosa)[1]
n2<-dim(Versi)[1]
n<-n1+n2
W<-((n1-1)*varX1 +(n2-1)*varX2)/(n-2)
discrim<-t(barx1-barx2)%*%solve(W)
@

For example, classify as 1 if:
$-11.43636x_1+    14.14303x+2 + 18.74>0$

<<echo=FALSE>>=
Cat1<-discrim %*% t(Setosa) + 18.74>0
Cat2<-discrim %*% t(Versi) + 18.74<=0
SetVer<-subset(iris,Species%in%c("setosa","versicolor"))
SetVer$Species<-factor(SetVer$Species)
SetVer<-SetVer[,c(1,2,5)]
#head(SetVer)
Cats<-ifelse(discrim%*%as.matrix(t(SetVer[,1:2]))+18.74>0,
             "1","2")
@

The rule achieves 100\% discrimination.

\subsection{Likelihood ratio discrimination rule}
 
 Let $T= (n_1 -1)S_1 + (n_2 - 1)S_2$.
 
 vector $x$ is allocated to 1 if
 
 \begin{equation}
 \frac{n_2}{n_2 + 1} (x-\bar{x}_2)'T^{-1}(x-\bar{x}_2) > \frac{n_1}{n_1+1}(x-\bar{x})_1)'T^{-1} (x-\bar{x}_1)
 \end{equation}
 
 If $n_1=n_2$, then this is the same as the Sample ML Discrimination Rule above.
 
\subsection{Fisher linear discriminant function} 
 
 k=2. Let $B=\frac{n_1n_2}{n}(\bar{x}_1 - \bar{x}_2)(\bar{x}_1 - \bar{x}_2)' = \frac{n_1n_2}{n}d d'$.
 
 $W^{-1} B$ has only one non-zero eigenvalue: $\frac{n_1n_2}{n} d' W^{-1}d$, and has eigenvector $W^{-1}d$.
 
 Rule: Allocate to category 1 if
 
 \begin{equation}
 (\bar{x}_1-\bar{x}_2)' W^{-1}(x-(\bar{x}_1+\bar{x}_2)/2) > 0
 \end{equation}
 
 where $W=\frac{1}{n-k} \sum_{i=1}^k (n_i -1) S_i$.
  
 
 Linear discriminant function:
 
$ h(x) = a'x \quad a=W^{-1}d$
 
\subsection{Probability of misclassification} 
 
 Let $p_{ij} = Prob (\hbox{a type j object is classified as type i})$.
 
 \begin{equation}
 \hat p_{12} = \Phi(-\frac{1}{2}\hat \Delta)
\quad \hat \Delta^2 = (\bar{x}_1 - \bar{x}_2)'S^{-1}(\bar{x}_1 - \bar{x}_2)
 \end{equation}
 
 Alternatively, for discrimination rule $R_i$ compute
 
 $n_{ij} = \mid\{ x \hbox{ in population j and }x\in R_i \}\mid$
 
 then compute $\hat p_{ij} = n_{ij}/\sum n_{ij}$.
 
 \paragraph{Computing misclassification prob.\ given h(x)}
 
 Having computed h(x), misclassification prob.\ of choosing cat 1 when the sample is cat 2 is computed as follows:
 
 \begin{enumerate}
 \item Compute $E(h(x_2))=\mu, Var(h(x_2))=\sigma^2$, the expectation and variance of the distribution of h(x) when using the Fisher rule to classify as cat 1, when it's actually cat 2. Note that $E[x_2]$ and $Var(x_2)$ and $Cov(x_1, x_2)$ are always given.
 \item Find out $P(h(x)>0 \mid h(x)\sim N(\mu,\sigma^2))$; $z=\mid \mu \mid /\sigma$, and then $1-\Phi(z)$ is the misclassification probability.
\end{enumerate} 
 
 Similarly, the misclassification prob.\ of choosing cat 2 when the sample is cat 1 is computed by computing expectation and variance of $h(x_1)$.
 
\paragraph{Common situation: Data from only one category can be used/is available}

\textbf{See Task 11 for a worked example}.

Given $\bar{x}_1 = (x_1 x_2)$, and $\bar{x}_2 = (x_3 x_4)$, we use a scalar version of the Fisher rule. Suppose only $x_1$ and $x_3$ (variable 1) can be used, and common variance of $x_1$ and $x_3$ has already been computed by pooling variance (the first cell in S, i.e., S[1,1]). Then: classify as cat 1 if

\begin{equation}
(x_1 - x_3) S[1,1]^{-1} (x_1 - (x_1+x_3)/2)>0
\end{equation}

This gives some rule like: classify as cat 1 if $x_1 > c$ or $x_1 < c$. However, if sample is from cat 2, we have to compute the probability of getting a c or something more extreme given that the sample comes from cat 2. Since cat 2 is known to come from some $N(\mu_0,\sigma_0^2)$, we compute the prob.\ as follows:

If rule is $x_1 > c$, then compute $P(X>c \mid \mu_0, \sigma_0)$. We need to compute the quantile: $\frac{c-\mu_0}{\sigma_0}$.

If rule is $x_1 < c$, then compute $P(X<c \mid \mu_0, \sigma_0)$. We need to compute the quantile: $\frac{c-\mu_0}{\sigma_0}$.

\subsection{Some typical LDA examples}

\underline{Isis lda prediction example}:

<<>>=
library(MASS)
Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
                   Sp = rep(c("s","c","v"), rep(50,3)))
train <- sample(1:150, 75)
table(Iris$Sp[train])
z <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, subset = train)
predicted<-predict(z, Iris[-train, ])$class
## 100% accuracy:
table(predicted,Iris[-train, ]$Sp)
## prop correct is (diagonal cell)/colsums 
## Jacknife estimate of true prob. of correct classfn:
z.CV <- lda(Sp ~ ., Iris, prior = c(1,1,1)/3, CV=TRUE)
table(z.CV$class,Iris$Sp)
@

\underline{Claypots lda (Task 11)}

<<>>=
load("data/claypots.Rdata")

two<-21
five<-64:69
dat<-claypots[-c(two,five,70:272),]
clay134lda<-lda(Group~A+B+C+D+E+F+G+H+I,dat)
#str(clay134lda)
plot(clay134lda)
heldout<-claypots[c(two,five),]
predicted<-predict(clay134lda,heldout[,2:10])
predictedvals<-data.frame(gp=factor(c(1,rep(5,6))),
                          predicted$x)
points(predictedvals[1,2:3],pch=19,cex=2)
points(predictedvals[2:7,2:3],pch=2,cex=2)

evaluation<-data.frame(heldout[,1],predicted$class,
           round(predicted$posterior,digits=2))
colnames(evaluation)<-c("heldout_group","pred_class",
                        "gp1","gp3","gp4")
evaluation
@

\section{FAQs}

\subsection{UITs}

 ``What linear combination of the two treatments shows the greatest mean change?''
 
Ans: In direction $S_{ch}^{-1} (\bar{x}_A-\bar{x}_B)$.

\subsection{Mahalanobis distance}

``Show that the ML rule classifies a new observation x into that population whose Maha.\ distance from x is the smallest.''

$f_i(x) = (2\pi)^{-(p/2)} \mid \Sigma \mid^{-1/2 \exp(-(x-\mu_i)'\Sigma^{-1})(x-\mu_i)/2} $

which is maximized when the $\Delta^2$ term is minimized, i.e., allocate x to the category whose mean has the smallest M distance from x.

\end{multicols}

\end{document}