Machine Learning
Theorical Background
- Types of Machine Learning
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Self Learning
- Feature Learning
- Sparse Dictionary Learning
- Anomaly Detection
- Association Rules
- Two Supervised Learning Methods
- Regression
- Classification
- Empirical Risk Minimization
- Loss and Risk Functions
- Algorithms
Supervised Learning
A problem of inferring, estimating or learning a function $f$ for explaining $y$ based on $f(x)$
- $y\in \mathcal{Y}$: Output (Dependent, Response Variable or Label)
- $x\in \mathcal{X}$: Input (Independent, Predictive Variable of Feature)
- $f\in \mathcal{F}$: $\mathcal{X}\rightarrow \mathcal{Y}$: Model (Regression Function or Hypothesis, Machine)
c.f., Unsupervised Learning
Regression & Classification
- Regression
- $\mathcal{Y}$: Infinite and Ordered ($\mathcal{Y}=\mathbb{R},\mathbb{R}^+,\ \mathbb{N}\cup\lbrace0\rbrace,\ …$)
- Linear Regression: $\mathcal{Y}=\mathbb{R}$
- Poisson Regression: $\mathcal{Y}=\mathbb{N}\cup\lbrace0\rbrace$
- Classification
- $\mathcal{Y}$: Finite and Unordered ($\mathcal{Y}=\lbrace0,\ 1\rbrace,\ \lbrace-1,\ 1\rbrace,\ \lbrace1,\ …,\ k\rbrace,\ …$)
- Logistic Regression: $\mathcal{Y}=\lbrace0,\ 1\rbrace$
- Support Vector Machine: $\mathcal{Y}=\lbrace-1,\ 1\rbrace$
- Multinomial Regression: $\mathcal{Y}=\lbrace1,\ …,\ k\rbrace$
- Prediction Based on $x$
- Regression: $y_x=f(x)$
- Classification: $y_x=\phi(x)$
- $\phi(x)$: Classifier as a Function of $f(x)$
- e.g., $\phi(x)=sign(f(x))$
Loss, Risk and Empirical Risk Minimization
- Loss (Cost) Function: $L(y,\ f(x))$
- It measures
loss
orcost
for the use of $f$ - Regression: $y\in \mathbb{R}$
- Square: $(y-f(x))^2$
- Quantile: $\rho_\tau(y-f(x)),\ 0<\tau<1$
- Huber: $H_\delta(y-f(x)),\ \delta>0$
- Classification: $y\in {-1,\ 1}$
- Zero-One: $I[y\neq\phi]=I[yf(x)<0]$
- Logistic: $log\lbrace1+exp(-yf(x))\rbrace$
- Exponential: $exp(-yf(x))$
- Hinge: $\max\lbrace0,\ 1-yf(x)\rbrace$
- $\psi$: $I[yf(x)\leq0]+\max\lbrace0,\ 1-yf(x)\rbrace I[yf(x)>0]$
- It measures
- Risk Function: $R(f)=E_{(y,\ x)}L(y,\ f(x))$
- Long-Term average of losses or costs
- Square Loss: $E(y-f(x))^2$
- Zero-One Loss: $EI[y\neq \phi(x)]=P[y\neq \phi(x)]$
- Condition Risk: $R(f)=E_{y|x}[L(y,\ f(x))|x]$
- Long-Term average of losses or costs
- Learning $f\in\mathcal{F}$
- Learning (Estimating, Training) an optimal
$f\in\mathcal{F}$ or $R(f)=E_x\lbrace E_{y|x}L(y,\ f(x))|x\rbrace\geq R(\hat f),\ \forall f\in\mathcal{F}$ - Example)
- Square loss with $f(x)=x^T\beta$
- $\hat\beta=E(xx^T)^{-1}E(xy)$
- Zero-one loss with $\phi(x)$
- $\hat\phi(x)=arg\max_kP(y=k|x)$
- $\hat\phi$: Bayes Classifier
- $R(\hat\phi)$: Bayes Risk
- $\hat\phi(x)=arg\max_kP(y=k|x)$
- Square loss with $f(x)=x^T\beta$
- Learning (Estimating, Training) an optimal
- Empirical Risk Minimization:
$\hat f=arg\min_{f\in \mathcal{F}}R_n(f)$ - Empirical Risk Function:
$R_n(f)=\Sigma^n_{i=1}L(y_i,\ f(x_i))/n$- Training Sample ($y_i,\ x_i$), $i\leq n$:
Independent samples / observations of $(y,\ x)$ - Linear Regression with Square Loss:
$\hat \beta =arg\min_\beta\Sigma^n_{i=1}(y_i-x^T_i\beta)^2/n$ - Logistic Regression with Logistic Loss:
$\hat\beta=arg\min_\beta\Sigma^n_{i=1}\lbrace-y_ix^T_i\beta+log(1+exp(x^T_i\beta))\rbrace/n$
- Training Sample ($y_i,\ x_i$), $i\leq n$:
Iterative Algorithms
- Univariate Function Minimization
- Iterative Grid Search Algorithm
- Golden Section Algorithm
- Multivariate Function Minimization
- Coordinate Descent Algorithm
- Gradient Descent Algorithm
Univariate Function Minimization
- Problem: find $x^*=arg\min_{x\in[a,\ b]}f(x)$
- Iterative Algorithm: for $n=1,\ 2,\ …$
- $x^*\in[a_{n+1},\ b_{n+1}]\subset[a_n,\ b_n]$
- (Lemma) $f$: convex on $[a_n,\ b_n]$
- $
\begin{equation}
\left.
\begin{gathered}x^*\in[a_n,\ b_n]\newline \exists_C\in[a_{n+1},\ b_{n+1}]\subset[a_n,\ n_n]\newline s.t.\ \ f(a_{n+1})>f(c)<f(b_{n+1})
\end{gathered}
\right\rbrace\rightarrow x^\ast [a_{n+1},\ b_{n+1}] \notag
\end{equation}
$
- $
Grid Search
- Non-Iterative Grid Search Alhorithm: $x^*\approx arg\min_{z\in\lbrace z_k=a+k(b-a)/m|k=0,\ …,\ m\rbrace}f(z)$
- Iterative Grid Search Algorithm: for $n=1,\ 2,\ …$
- $
\begin{equation}
\left\lbrace
\begin{gathered}
a_{n+1}=a_n+(k_n-1)(b_n-a_n)/m \newline
b_{n+1}=a_n+(k_n+1)(b_n-a_n)/m
\end{gathered}
\right. \notag
\end{equation}
$ - $k_n=arg\min_{k\in\lbrace0,\ …,\ m\rbrace}f(a_n+k(b_n-a_n)/m)$
- $
Golden Section Algorithm
- Golden Section Algorithm: for $n=1,\ 2,\ …$
- Two inner points
- $
\begin{equation}
\left.
\begin{gathered}
c_n=\gamma a_n+(1-\gamma)b_n \newline
d_n=\gamma b_n+(1-\gamma)a_n
\end{gathered}
\right. \notag
\end{equation}
$ - Golden Ratio: $\gamma=(\sqrt{5}-1)/2\approx0.618034$
- $[c_n,\ d_n]\subset[a_n,\ b_n]$
- $
- Update Rule
- $
\begin{equation}
\left.
\begin{gathered}
f(c_n)>f(d_n)\rightarrow[a_{n+1},\ b_{n+1}]=[c_n,\ b_n] \newline
f(c_n)<f(d_n)\rightarrow[a_{n+1},\ b_{n+1}]=[a_n,\ d_n]
\end{gathered}
\right. \notag
\end{equation}
$
- $
- Two inner points
Multivariate Function Minimization
- Problem: $x^*=arg\min_{x\in\mathbb{R}^p}f(x)$
- Iterative Algorithm: for $n=1,\ 2,\ …$
- $x_{n+1}=x_n+\alpha_nd_n,\ \alpha_n>0,\ f(x_{n+1})<f(x_n)$
- $\alpha_n\in\mathbb{R}$: Step Size or Learning Rate
- $d_n\in\mathbb{R}^p$: Descent Direction Vector
- How to find $\alpha_n$ and $d_n$?
- Find a descent direction $d_n$
- Find a step size $\alpha_n$
- $a_n=arg\min_{t>0}g_n(t),\ g_n(t)=f(x_n+td_n)$
Descent Condtion for Direction Vector
- Direction Opposite to Current Gradient
- Let $x_\alpha=x+\alpha d$, then by the first order Taylor’s expansion around $x$, we have $f(x_\alpha)$
- $\begin{equation}\begin{aligned}f(x_\alpha)&=f(x)+\nabla f(x)^T(x_\alpha-x)+o(||x_\alpha-x||)\newline &=f(x)+\alpha\nabla f(x)^Td+o(\alpha||d||)\newline &=f(x)+\alpha\nabla f(x)^Td+o(\alpha)\end{aligned}\notag\end{equation}$
- $\nabla f(x)=\partial f(x)/\partial x_j,\ j\leq p$: Gradient vector of $f$ at $x$
- (Descent Condition) So we have, $f(x_\alpha)-f(x)<\alpha\nabla f(x)^Td<0$, for all sufficiently small $\alpha>0$, if $\nabla f(x)^Td<0$
- Let $x_\alpha=x+\alpha d$, then by the first order Taylor’s expansion around $x$, we have $f(x_\alpha)$
Gradient Descent Algorithm
- General Form with Negative Gradient Direction
- $x_{n+1}=x_n+\alpha_nd_n=x_n-\alpha_nD_n\nabla f(x_n)$
- $D_n$: Strictly Positive Definite Matrix
- $\nabla f(x_n)^TD_n\nabla f(x_n)>0$
- $x_{n+1}=x_n+\alpha_nd_n=x_n-\alpha_nD_n\nabla f(x_n)$
- Various Directions Vectors
- Steepest Descent: $D_n=I_p\rightarrow x_{n+1}=x_n-\alpha_n\nabla f(x_n)$
- Newton-Raphson: $D_n=\nabla^2f(x_n)^{-1},\ \alpha_n=1$
- Modified Newton’s Method: $D_n=\nabla^2f(x_0)$ for some $x_0$
- Diagonally Scaled Steepest Descent: $D_n=[diag(\nabla^2f(x_n))]^{-1}$
Step Size of Gradient Descent Algorithm
- Limited Minimization Rule
- $f(x_n+\alpha_nd_n)=\min_{\alpha\in[0,\ s]}f(x_n+\alpha d_n)$, for some $s>0$
- Successive Step Size Reduction (Armijo Rule, Armijo [1996])
- Find the first integer $m\geq 0$
that satisfies the stopping rule $f(x_n+r^msd_n)-f(x_n)\leq\sigma r^ms\nabla f(x_n)^Td_n$,
for some constant $s>0,\ 0<r<1$ and $0<\sigma<1$ - Practical Choice: Try $\alpha_n=1,\ 1/2,\ 1/4,\ …$
- Find the first integer $m\geq 0$
- Deterministic Step Size: Small Constant (0.1, 0.01, 0.001)
- A Sequence: Square summable but not summable ($\propto1/n$), nonsummable diminishing ($\propto1/\sqrt{n}$), …
Coordinate Descent Algorithm
- Successive Minimization of Univariate Functions
- Iterative Algorithm: for $n=1,\ 2,\ …$ repeat inner cycling below
- Inner Cycling: for $j=1,\ 2,\ …,\ p$
- $x_{(n+1)j}arg\min_{t\in\mathbb{R}}f(x_{(n+1)1},\ …,\ x_{(n+1)(j-1)},\ t,\ x_{n(j+1)},\ …,\ x_{np}$
- Direction vector for the $j$th inner cycling step
- $
\begin{equation}
d_{nk}=
\left\lbrace
\begin{gathered}-sign(\partial f(\tilde x_{nj})/\partial x_j),\ &k=j\newline 0,\ &k\neq j
\end{gathered}
\right. \notag
\end{equation}
$ - $\tilde x_{nj}=(x_{(n+1)1},\ …,\ x_{(n+1)(j=1)},\ x_{nj},\ x_{n(j+1)},\ …,\ x_{np})$
- $
- Inner Cycling: for $j=1,\ 2,\ …,\ p$
- Inner Cycling Minimization
- Closed Form vs. Univariate Minimization
- Invariance to the order of coordinates
- Individual Corrdinate vs. A Block of Coordinates
one-at-a-time
notall-at-once
- Convergence Cases
- $f$: Convex and Differentiable $\rightarrow$ Always converges to a local minimizer
- $f(x,\ y)=x^2+y^2+x+10y$
- $f$: Convex but Not Differentiable $\rightarrow$ May stuck in a point that is not a local minimizer
- $f(x,\ y)=(x-y)^2+(x+y)^2+|x-y|+20|x+y|$
- CD algorithm always converges if $f(x)=g(x)+\Sigma^p_{j=1}h_j(x_j)$, where $g$ is convex differentiable and $h_j$ is convex but may not differentiable for all $j\leq p$ [Tseng, 2001]
- $f(x,\ y)=x^2+y^2+|x|+10|y|$
- $f$: Convex and Differentiable $\rightarrow$ Always converges to a local minimizer
Sub-Gradient (Descent) Algorithm
- $u(x_0)$: Sub-gradient of $f$ at $x_0$ if $f(x)\geq f(x_0)+u(x_0)^T(x-x_0),\ \forall x\in B(x_0,\ \delta)$
- Sub-Gradient Descent Algorithm: $x_{n+1}=x_n-\alpha_nu_n$
- $u_n$: Sub-gradient at $x_n$
- $f$: Convex but may not differentiable
- Not a descent method
- Stopping rule by tracing $f^{best}_n=\min\lbrace f(x_1),\ …,\ f(x_n)\rbrace$
- Deterministic step size
- $\alpha_n$: Small constant, square summable but not summable ($1/n$), nonsummable diminishing ($1/\sqrt{n}$), …
Stochastic Gradient (Descent) Algorithm
- Minimizing Empirical Risk Function: $R_n(\beta)=\Sigma^n_{i=1}L(y_i,\ x_i^T\beta)$
- Batch: Whole Samples
- $\beta_{t+1}=\beta_t-\alpha_t\Sigma^n_{i=1}\nabla L(y_i,\ x_i^T\beta_t)$
- Stochastic: One Random Smaple $(y_i,\ x_i)$
- $\beta_{t+1}=\beta_t-\alpha_t\nabla L(y_i,\ x_i^T\beta_t)$
- Mini-Batch: A Number of Random Sample $(y_i,\ x_i),\ i\in S$
- $\beta_{t+1}=\beta_t-\alpha_t\Sigma_{i\in S}\nabla L(y_i,\ x_i^T\beta_t)$
- Batch: Whole Samples
Learning Methods
- Linear and Logistic Regression
- High-Dimensional Linear Regression
- Ridge
- LASSO
- SCAD
- Forward Greedy Methods
- Forward Selection
- Stagewise Regression
- LARS
- Support Vector Machine
- Artificial Neural Network
- Tree and Ensemble
- Bagging
- Random Forest
- Boosting
Linear and Logistic Regression
Linear Regression
- Regression Function and Prediction
- $f(x)=x^T\beta,\ y_x=f(x)=x^T\beta$
- Least Square Estimator, LSE: Empirical Risk Function with Square Loss
- $R_n(\beta)=\Sigma^n_{i=1}(y_i-x_i^T\beta)^2$
- Algorithm
- Direct Calculation
- $\hat\beta=(\Sigma^n_{i=1}x_ix_i^T)^{-1}\Sigma^n_{i=1}x_iy_i$
- Gradient Descent Algorithm
- $\beta_{t+1}=\beta_t+\alpha_t\Sigma^n_{i=1}x_i(y_i-x_i^T\beta_t)$
- Inner Cycling for Coordinate Descent
- $\beta_{(t+1)j}=\Sigma^n_{i=1}x_{ij}(y_i-\Sigma_{k\neq j}x_{ik}\tilde{\beta}_{tk}/\Sigma^n_{i=1}x^2_{ij})$
- Direct Calculation
Other Empirical Risk Functions
- Generalized Least Square Estimation
- $R_n(\beta)=\Sigma_{i,\ j}(y_i-x_i^T\beta)\Omega_{ij}(y_j-x_j^T\beta)$
- Quantile Regression
- $R_n(\beta)=\Sigma^n_{i=1}\rho_\tau(y_i-x_i^T\beta)$
- Linear Programming
- Sub-Gradient Descent
- Regularization (Penalization) via $J_\lambda$
- $R_n^\lambda(\beta)=R_n(\beta)+\Sigma^p_{i=1}J_\lambda(|\beta_j|),\ \lambda>0$
- Coordinate Descent Algorithm
- Sub-Gradient Descent
- Convex Concave Procedure
Logistic Regression
- Regression Function and Prediction
- $f(x)=x^T\beta,\ y_x=\phi(x)=I(f(x)>0)$
- Empirical Risk Function
- $R_n(\beta)=\Sigma^n_{i=1}(-y_ix_i^T\beta+log(1+exp(x_i^T\beta)))$
- Algorithm
- Gradient Descent Algorithm
- $\beta_{t+1}=\beta_t-(\Sigma^n_{i=1}x_ip_{ti}(1-p_{ti})x^T_i)^{-1}\Sigma^n_{i=1}(p_{ti}-y_i)x_i$
- Gradient Descent Algorithm
Statistical Interpretation
- Maximum Likelihood Estimation
- Empirical Risk Function: Negative Log-Likelihood Function
- Linear Regression
- Normal Distribution Assumption with Known Variance
- $y_i|x_i\sim N(f(x_i),\ \sigma^2),\ i\leq n$
- Logistic Regression
- Bernoulli Distribution Assumption
- $y_i|x_i\sim B(1,\ exp(f(x))/(1+exp(f(x)))),\ i\leq n$
- Another Equivalent Form for $y\in \lbrace-1,\ 1\rbrace$
- Empirical Risk Function with Logistic Loss
- $R_n(\beta)=\Sigma^n_{i=1}log(1+exp(-y_ix_i^T\beta))$
High-Dimensional Linear Regression
Penalized Estimation
- Penalized Empirical Risk Functrion
- $R_n^\lambda(f)=R_n(f)+J^\lambda(f)$
- $R_n$: Empirical Risk Function
- $f\in\mathcal{F}$: Regression Fuction
- $J_\lambda$: Penalty Function
- $\lambda$: Tuning Parameter (Vector)
- Objectives and Characteristics
- Regularization
- Variable Selection
- High-Dimensional Data Analysis
- Structure Estimation
Variable Selection in Linear Regression
- Penalized Empirical Risk Function for The Square Loss
- $R_n^\lambda(\beta)=\Sigma^n_{i=1}(y_i-x_i^T\beta)+\Sigma^p_{j=1}J^\lambda(|\beta_j|)$
- Algorithm: Penalty Functions
- Ridge: $J^R_\lambda(|\beta|)=\lambda\beta^2$
- LASSO: $J^L_\lambda(|\beta|)=\lambda|\beta|$
- Bridge: $J^B_\lambda(|\beta|)=\lambda|\beta|^\nu,\ \nu>0$
- SCAD: $\nabla\lambda^S_{\lambda,\ a}(|\beta|)=\lambda I(|\beta|<\lambda)+(\lambda-a|\beta|)+a/(a-1)I(|\beta|\geq\lambda)$
- MC: $\nabla J^M_{\lambda,\ a}(\beta)=(\lambda-\beta/a)_+$
- Elastic Net: $J^E_{\lambda,\ \gamma}(\beta)=\lambda\beta+\gamma\beta^2$
Ridge
- Example) Simple Linear Regression
- $\begin{equation}\left\lbrace\begin{gathered}R_n(\beta)=\Sigma^n_{i=1}(y_i-x_i\beta)^2=(\beta-3)^2/2\newline R^\lambda_n(\beta)=(\beta-3)^2/2+\lambda\beta^2\end{gathered}\notag\right.\end{equation}$
- Least Square Estimator
- $\tilde{\beta}=arg\min_\beta R_n(\beta)=3$
- Ridge [Hoerl and Knnard, 1970]
- $\tilde{\beta}^R=arg\min_\beta R^\lambda_n(\beta)=1/(1+2\lambda)\tilde{\beta}$
- Results
- Shrinkage or Biased: $|\hat\beta^R|<|\tilde{\beta}|$ for any $\lambda>0$
- Better Mean Square Error Under Multicollinearity
LASSO
- (Cont.) Simple Linear Regression
- $R^\lambda_n(\beta)=(\beta-3)^2/2+\lambda|\beta|$
- Least Square Estimator
- $\tilde{\beta}=arg\min_\beta R_n(\beta)=3$
- Least Absolute Shrinkage and Selection Operator (LASSO) [Tibshirani, 1996]
- Results
- Shrinkage or Biased: $|\hat\beta^L|<|\tilde{\beta}|$ for any $\lambda>0$
- Sparsity or Variable Selection: $\hat\beta^L=0$ or $\hat\beta^L\neq0$ for some $\lambda$
SCAD
- Smoothly Clipped Absolute Deviation (SCAD) [Fan and Li, 2001]
- $R_n^\lambda=(\beta-3)^2+J_\lambda(|\beta|)$
- Solution
- $\begin{equation}\hat\beta^2=arg\min_\beta R^\lambda_n(\beta)=\left\lbrace\begin{aligned}0,\ &\ |\tilde{\beta}|<\lambda\newline u(\tilde{\beta},\ \lambda),\ &\ \lambda\leq|\tilde{\beta}|<a\lambda\newline\tilde{\beta},\ &\ |\tilde{\beta}|\geq a\lambda\end{aligned}\right.\notag\end{equation}$
- Results
- No Shrinkage or Unbiased: $\hat\beta^S=\tilde{\beta}$ for any $\lambda\leq|\tilde{\beta}|/a$
- Sparsity: $\hat\beta^S=0$ or $\hat\beta^S\neq0$ for some $\lambda$
Coordinate Descent Algorithm
- LSE
- Problem: Minimize $R_n(\beta)=\Sigma^n_{i=1}(y_i-x_i^T\beta)^2/2$
- Coordinate Function of $\beta_j$
- $R_{nj}(\beta_j)\propto\lbrace\Sigma^n_{i=1}x_{ij}^2/2\rbrace\beta_j^2-\lbrace\Sigma^n_{i=1}x_{ij}(y_i-\Sigma_{k\neq j}x_{ik}\beta_k)\rbrace\beta_j$
- Coordinate Descent Algorithm
- $\tilde{\beta}_{(t+1)j}=\lbrace\Sigma^n_{i=1}x_{ij}(y_i-\Sigma_{k\neq j}x_{ik}\tilde{\beta}_{tk})\rbrace/\lbrace\Sigma^n_{i=1}x_{ij}^2\rbrace$
- LASSO
- Problem: Minimize $R^\lambda_n(\beta)=R_n(\beta)+\lambda\Sigma^p_{j=1}|\beta_j|$
- Coordinate Function of $\beta_j$
- $R_{nj}^\lambda(\beta_j)\propto R_{nj}(\beta_j)+\lambda|\beta_j|$
- Coordinate Descent Algorithm [Friedman et al., 2007]
- $\beta_{(t+1)j}=soft(\tilde{\beta}_{(t+1)j},\ \lambda/\Sigma^n_{i=1}x_{ij}^2)$
- Soft-Thresholding Operator
- $soft(x,\ \lambda)=arg\min_\beta\lbrace(\beta-x)^2/2+\lambda|\beta|\rbrace=sign(x)(|x|-\lambda)_+$
- SCAD
- Problem: Minimize $R_n^\lambda(\beta)=R_n(\beta)+\Sigma^p_{j=1}J_S^\lambda(|\beta_j|)$
- CCCP + CD Algorithm [Kim et al., 2008], [Lee et al., 2016]
- $\beta_{t+1}=arg\min_\beta U^{\lambda}_n(\beta|\beta_t)$
- Conbex-Concave Decomposition of The Penalty
- $J_S^\lambda(|\beta_j|)=\lambda|\beta_j|+(J_S^\lambda|\beta_j|-\lambda|\beta_j|)$
- Upper Tight Convex Function of $R_n^\lambda$ at $\beta^c$
- $U^\lambda_n(\beta|\beta^c)=R_n(\beta)+\Sigma^p_{j=1}\lbrace(\nabla J^\lambda_S(|\beta^c_j|)-\lambda sign(\beta^c_j))\beta_j+\lambda|\beta_j|\rbrace$
Applications
- High Dimensional Data Analysis
- Number of Parameters $\geq$ Number of Samples
- Allow Various Over-Parameterized Models
Regularization
orStabilization
- Applied together with almost all learning methods
- Prevents overfit
- Structure estimation
- Structure Construction
- LASSO: $\lambda\Sigma^p_{j=1}|\beta_j|\leftrightarrow\hat\beta_j=0$ or not
- Fused LASSO [Tibshirani et al., 2005]
- $\lambda_1\Sigma^p_{j=1}|\beta_j|+\lambda_2\Sigma^p_{j=2}|\beta_j-\beta_{j-1}|$
- $\lambda_1$: $\hat\beta_j=0$
- $\lambda_2$: $\hat\beta_k=\hat\beta_{k+1}=…=\hat\beta_{k’}\neq0$
- $\lambda_1\Sigma^p_{j=1}|\beta_j|+\lambda_2\Sigma^p_{j=2}|\beta_j-\beta_{j-1}|$
- Group LASSO [Yuan and Lin, 2006]
- $\lambda\Sigma^K_{k=1}|\sqrt{\Sigma^{p_k}_{k=1}\beta^2_{kj}}$
- Strong Heredity Interaction LASSO [Choi et al.m 2010]
- $\lambda\Sigma^p_{j=1}|\beta_j|+\lambda_2\Sigma_{j<k}|\eta_{kj}|,\ \eta_{jk}=\delta_{jk}\beta_j\beta_k$
- $\lambda_1$: $\beta_j=0$
- $\lambda_2$: $\beta_j=0$ or $\beta_k=0\rightarrow\eta_{jk}=0$
- $\lambda_3$: $\beta_j\neq0$ and $\beta_k\neq0\rightarrow\eta_{jk}\neq0$
- $\lambda\Sigma^p_{j=1}|\beta_j|+\lambda_2\Sigma_{j<k}|\eta_{kj}|,\ \eta_{jk}=\delta_{jk}\beta_j\beta_k$
Forward Greedy Methods
Forward Selection in Linear Regression
- Algorithm: Residual-Based Forward Selection
- Let $\beta=0$
- For $t=1,\ 2,\ …$ repeat the followings
- Calculate residuals $r_i=y_i-x_i^T\beta^t,\ \ i\leq n$
- Find $k=arg\min_j\Sigma_i(r_i-x_{ij}\beta_j)^2$
- Find $\beta^{t+1}=arg\min_{\beta^t,\ \beta_k}\Sigma_i(y_i-\Sigma_Ix_{iI}\beta_I^t-x_{ik}\beta_k)^2$
- Intuition and Result
- It iteratively includes the variable $x_k$ that is the most correlated with current residual
- Sequence of Least Square Estimators: $\beta^0\rightarrow\beta^1\rightarrow\beta^2\rightarrow…$
Forward Stagewise Regression
- Originality
- Stepwise Least Squares [Goldberger, 1961]
- Residual Analysis [Freund et al., 1961]
- Algorithm
- Let $\beta=0$ and fix $\lambda^{FSW}>0$ as a small constant
- For $t=1,\ 2,\ …$ repeat the followings
- Calculate residuals $r_i=y_i-x_i^T\beta^t,\ i\leq n$
- Find $(k,\ \beta_k)=arg\min_{j,\ \beta_j}\Sigma_i(r_i-x_{ij}\beta_j)^2$
- Update $\beta^{t+1}=\beta^t+\lambda^{FSW}sign(\beta_k)$
- Result
- We get a long sequence of regularized estimators because of the choise $\lambda^{FSW}$
Least Angle Regression
- Least Angle Regression (LARS) [Efron et al., 2004]
- The same as the stragewise regression except that $\beta^{t+1}=\beta^t+\lambda^{LARS}sign(\beta_k)$
- $\lambda^{LARS}$ makes the new residual equally correlated with all the input variables included
- $\lambda^{LARS}$ has easy closed from w.r.t. current $\beta^t$
- The same as the stragewise regression except that $\beta^{t+1}=\beta^t+\lambda^{LARS}sign(\beta_k)$
- Some Connection
- LARS-LASSO: LARS + Deletion Step = LASSO
- LARS-Stagewise: Stagewise with $\lambda^{FSW}\rightarrow0=$LARS
Support Vector Machine
- Support Vector Machine (SVM) for Binary Classification [Vapnik, 1995]
- Regression Function and Prediction
- $f(x)=b+x^Tw,\ \ y_x=\phi(x)=sign(f(x))$
- Pernalized Problem: Pernalized Empirical Risk Function
- $R_n^\lambda(\beta)=\Sigma_{i=1}^n\lbrace1-y_if(x_i)\rbrace_++\lambda||w||^2,\ \lambda>0$
- Hinge Loss + Ridge Penalty
- $x_+=\max\lbrace0,\ x\rbrace$
- Stochastic: Sub-Gradient Algorithm [Shalev-Shwartz et al., 2007]
- $w_{t+1}=w_t+\alpha_t\lbrace-\Sigma^n_{i=1}y_ix_iI(y_if(x_i)<1)+2\lambda w_t\rbrace$
- Regression Function and Prediction
Maximizing Margin
- Primal Problem: Minimizing $||w||^2/2+\gamma\Sigma^n_{i=1}\xi_i,\ \gamma>0$ subject to
$y_if(x_i)\geq1-\xi_i,\ \xi_i\geq0,\ i\leq n$- $\xi_i$: Slack Variable for Quantifying Degree of Missclassification
- $x_i$ with $\xi_i=0$: Support Vector
- $x_i$ with $0<\xi_i\leq1$: Correctly Classified Sample
- $x_i$ with $1<\xi_i$: Incorrectly Classified Sample
- $2/||w||$ (Margin): Distance between $f$ and support vector
- $\xi_i$: Slack Variable for Quantifying Degree of Missclassification
- Equivalence between primal and penalized problem
$y_if(x_i)\geq1-\xi_i,\ \xi_i\geq0\leftrightarrow\xi_i=\max\lbrace0,\ 1-y_if(x_i)\rbrace$
KKT Conditions
- Lagrange Primal Function with Multipliers:
$||w||^2/2+\gamma\Sigma^n_{i=1}\xi_i-\Sigma^n_{i=1}\alpha_i\lbrace y_if(x_i)-(1-\xi_i)\rbrace-\Sigma^n_{i=1}\mu_i\xi_i$- Stationarity: $w=\Sigma^n_{i=1}\alpha_iy_ix_i,\ \Sigma^n_{i=1}\alpha_iy_i=0,\ \gamma=\alpha_i+\mu_i,\ i\leq n$
- Primal and Dual Feasibility: $y_if(x_i)-(1-\xi_i)\geq0,\ \xi_i\geq0,\ i\leq n$
- Complementary Slackness: $\alpha_i\lbrace y_if(x_i)-(1-\xi_i)\rbrace=0,\ \mu_i\xi_i=0,\ i\leq n$
- Dual Problem: Minimizing lagrange dual objective function
$\Sigma^n_{i=1}\Sigma^n_{j=1}y_iy_j\alpha_i\alpha_j\langle x_i,\ x_j\rangle-2\Sigma^n_{i=1}\alpha_i$ subject to $\Sigma^n_{i=1}\alpha_iy_i=0,\ -\leq\alpha_i\leq\gamma,\ i=1,\ …,\ n$- Quadratic Programming (QP)
- CD Algorithm [Sequential Minimal Optimization, Platt, 1998]
- Relation between $\alpha_i$ and $\alpha_j,\ i\neq j$
- $y_i\alpha_i+y_j\alpha_j=-\Sigma_{k\neq i,\ j}y_k\alpha_k=c$
- Univariate optimization from thje inequality $\alpha_i=(c-y_j\alpha_j)y_i$ under constraints $L\leq\alpha_j\leq H$ for some $0\leq L<H\leq\gamma$
- Relation between $\alpha_i$ and $\alpha_j,\ i\neq j$
- Slope: Linear Combination of Samples (Stationarity)
- $w=\Sigma^n_{i=1}\alpha_iy_ix_i$
- Implicity of Slope
- $f(x)=x^Tw+b=\Sigma^n_{i=1}\alpha_iy_ix_i^Tx+b$
- Inner products are required only
- Classification depends on non-zero $a_i$s, i.e., support vector $x_i$s (Sparsity)
Non-Linear Regression Function
- Linear Solution from Optimization
- $f(x)=\Sigma^n_{i=1}\alpha_iy_i\langle x_i,\ x\rangle+b$
- Non-Linear Feature Transformation
- $x\rightarrow\Phi(x),\ f(x)=\Sigma^n_{i=1}\alpha_iy_i<\Phi(x_i),\ \Phi(x)>+b$
- Kernel Trick: Kernel Function $K$
- $K(x,\ x’)=\langle\Phi(x_i),\ \Phi(x’)\rangle\rightarrow f(x)=\Sigma^n_{i=1}\alpha_iy_iK(x_i,\ x)$
- Linear Kernel: $K(x,\ x’)=\langle x,\ x’\rangle$
- D-th Order Polynomial Kernel: $K(x,\ x’)=(\langle x,\ x’\rangle+I)^d$
- Gaussian Kernel: $K(x,\ x’)=exp(-||x-x’||^2/\sigma^2)$
- Sigmoid Kernel: $K(x,\ x’)=tanh(\kappa_1\langle x,\ x’\rangle+\kappa_2)$
- $K(x,\ x’)=\langle\Phi(x_i),\ \Phi(x’)\rangle\rightarrow f(x)=\Sigma^n_{i=1}\alpha_iy_iK(x_i,\ x)$
- SVM as a penalized estimation $R_n^\lambda(\beta)=\Sigma^n_{i=1}(1-y_if(x_i))_++\lambda||f||^2_{\mathcal{H}_K},\ \lambda>0$
- $\mathcal{H}_K$: Reproducing kernel Hilbert space w.r.t. a kernel $K$
- Representation Theorem: The minimizer of $R_n$ is represented as $f(x)=\Sigma^n_{i=1}\alpha_iy_iK(x_i,\ x)$
- Example)
- $\begin{equation}\left\lbrace\begin{gathered} K(x,\ x’)=\langle\Phi(x),\ \Phi(x’)\rangle\newline\mathcal{H}_K=\lbrace\Phi(x)^Tw+b|w\in\mathbb{R}^q,\ b\in\mathbb{R}\rbrace\end{gathered}\right.\notag\rightarrow||f_{\mathcal(H)_K}||^2=||w||^2\end{equation}$
- Example)
Artificial Neural Network
- Regression function with one hidden layer
- $y\in\mathcal{Y}\subset\mathbb{R}^k,\ f(x)=g(a(h(x)))$
- $h$: Non-linear transformation from $x$ to hidden units via an activation function $\sigma$
- $h(x)=(h_1(x),\ …,\ h_m(x))^T,\ h_j(x)=\sigma(x^Tw_j+b_j),\ j\leq m$
- Sigmoid: $\sigma(t)=1/(1+exp(-t))$
- ReLu: $\sigma(t)=max\lbrace0,\ t\rbrace$
- RBF: $\sigma(t)=exp(-v^2/2)$
- $a$: Linear combinations of hidden units
- $a(h)=(a_1(h),\ …,\ a_k(h))^T,\ a_s(h)=h^Tv_s+c_s,\ s\leq k$
- $g$: Transformation from $a$ to outputs via output functions $g_s,\ s\leq k$
- $g(a)=(g_1(a),\ …,\ g_k(a))^T$
- Regression, Identity: $g_s(a)=a_s$
- Classification, Softmax: $g_s(a)=exp(a_s)/\Sigma^k_{I=1}exp(a_I)$
Empirical Risk Function
- $R_n(\theta)=\Sigma^n_{i=1}L(y_i,\ f(x_i))$
- $(p+1)\times m+(m+1)\times k$ Parameters
- $\theta=(W,\ b,\ V,\ c)$
- $W=(w_1,\ …,\ w_m)^T,\ b=(b_1,\ …,\ b_m)^T$
- $V=(v_1,\ …,\ v_k)^T,\ c=(c_1,\ …,\ c_k)^T$
- Loss Function
- Regression: $L(y,\ f(x))=\Sigma^k_{s=1}(y_s-f_s(x))^2$
- Classification: $L(y,\ f(x))=-\Sigma^k_{s=1}\lbrace y_slog(f_s(x))\rbrace$
- Non-Convex (Non-Differentiable) Optimization
- (Sub) Gradient Descent Algorithm
- $\theta_{t+1}=\theta_t-\alpha_t\partial R_n(\theta_t)/\partial\theta$
Gradient Calculation (Backpropagation)
- Risk Function Structure
- $R=L(y,\ f)$
- $f=g(a)$
- $a=Vh+b$
- $h=(\sigma(x^Tw_j+b_j),\ j\leq m)^T$
- Chain Rule w.r.t. $v_{sl}$
- $\partial R/\partial v_{sl}=\nabla_fL\times\nabla_ag\times\nabla_{v_{sl}}a=\lbrace\nabla_fL\times\nabla_{a_s}g\rbrace\times h_I$
- Chain Rule w.r.t. $w_{jt}$
- $\partial R/\partial w_{jt}=\nabla_fL\times\Sigma^k_{s=1}\nabla_{a_s}g\times v_{sj}\times\nabla_{w_{jt}}h_j$
$=\lbrace\Sigma^k_{s=1}\lbrace\nabla_fL\times\nabla_{a_s}g\rbrace\times v_{sj}\times\sigma’(x^Tw_j+b_j)\rbrace x_t$
- $\partial R/\partial w_{jt}=\nabla_fL\times\Sigma^k_{s=1}\nabla_{a_s}g\times v_{sj}\times\nabla_{w_{jt}}h_j$
Tree and Ensemble
Decision Tree
- Regression Function
- $f(x)=\Sigma^M_{m=1}f_m(x)I[x\in R_m]$
- $M$: Numbers of Rectangles
- $R_m,\ m\leq M$: Disjoint Rectangular Areas in $\mathbb{R}^p$ (Terminal Nodes)
- $f_m(x),\ m\leq M$: Local Regression Functions of Interest
- Empirical Risk Function with A Loss $L$
- $R_n(f_m,\ R_m,\ m\leq M)=\Sigma^n_{i=1}L(y_u,\ f(x_i))=\Sigma^M_{m=1}\Sigma_{x_i\in R_m}L(y_i,\ f_m(x_i))$
- Top-Down Induction Algorithm
- Set an initial Tree $f=(f_m,\ R_m,\ m\leq M)$ and repeat the followings
- Given a tree $f$, construct $\mathcal{F}_f$ that is the set of all possible trees that can be constructed from $f$
- Find a tree in $\mathcal{F}_f$ that minimizes the empirical risk
- Stop if there is no gain in emprical risk or some stoping rules are satisfied
- The Number of Areas, The Largest Area, Depth, …
Growing Regression Tree by CART
- Regression Function
- Local Constant: $f_m(x)=c_m$
- Local Linear: $f_m(x)=x^T\beta_m$
- Given a tree $f^t=(f_m^t,\ R_m^t,\ m\leq M^t)$
- Find the set of all possible trees,
say $\mathcal{F}^t$, by splitting $R_m^t,\ m\leq M^t$
once: $R_m^t\rightarrow R_m^t\cap[x_j\leq c]\cup R_m^t\cap [x_j>c]$ - Find $f^{t+1}\in\mathcal{F}^t$ by minimizing $R_n(f_m,\ R_m,\ m\leq M)=\Sigma^n_{i=1}L(y_i,\ f(x_i))$
- Find the set of all possible trees,
- Local estimation of $f$ with square loss
- $L(y,\ f(x))=\Sigma^M_{m=1}(y-f_m(x))^2/[x\in R_m]$
- Local Constant: $\hat c_m=\bar y_m$
- Local Linear: $\hat{\beta}_m=(X_m^TX_m)^{-1}X_m^Ty_m$
- $X_m$ and $y_m$: The design matrix and target vector in $R^m$
- Regression Function: $f(x)=\Sigma^M_{m=1}f_m(x)I[x\in R_m]$
- Voting: $f_m(x)=arg\max_k P(y=k|x\in R_m)$
- $f^{t+1}\in\mathcal{F}^t$ by minimizing: $R_n(f_m,\ R_m,\ m\leq M)=\Sigma^n_{i=1}L(y_i,\ f(x_i))$
- Impurity Measure (Loss): $L(y,\ f(x))=\Sigma^M_{m=1}imp_m(y,\ f_m(x))P(x\in R_m)$
- Impurity Index: $imp_m(y,\ f_m(x))$
- Miss-Classification Index: $1-\max_kP(y=k|x\in R_m)$
- Gini Index: $1-\Sigma_kP(y=k|x\in R_m)^2$
- Entropy Index: $-\Sigma_kP(y=k|x\in R_m)log_2P(y=k|x\in R_m)$
- Local Estimation of $f$
- $\hat P(x_i\in R_m)=\Sigma_iI(x_i\in R_m)/n$
- $\hat f_m(x)=arg\max_k\Sigma_iI(y_i=k,\ x_i\in R_m)/\Sigma_iI(x_i\in R_m)$
ECCP
- Pruning a tree
- Too simple regression tree may be easy to interpret but fails to have good prediction accuracy, and too complex regression tree will overfit the samples
- Hence, pruning process is required
- One popular measure for pruning is the Empirical Cost-Complexity Pruning (ECCP)
- ECCP
- Let $f$ be the tree after the growing and stopping steps
- Given $\alpha_0$, we find the minimizer $\hat{f}_\alpha\in\mathcal{F}$ of the ECCP,
$C_{\alpha,\ n}(f_=\Sigma^M_{m=1}\Sigma_{x_i\in R_m}(y_i-f_m(x_i))^2+\alpha M$- $\mathcal{F}$: Set of all trees obtained by pruning $f$ once
- To determine $\alpha$ we may use some methods such as test error or cross validation
- Then the final empirical tree (estimator) $\hat{f}$ is the tree after growing, stopping and pruning
- Pros
- Simple interpretation
- Easy handling of continuous and categorical inputs
- Robust to input outliers
- Model-free
- Cons
- Low prediction accuracy
- Bad interpretation of large tree
- Unstable or high variance
- Bad estimation of non-squared areas
Ensemble Methods
Ensemble method is a supervised learning algorithm (technique) for combining many predictive models $\hat{f}_k,\ k\leq K$ (weak learners trained individually) into one final predictive model $\hat{f}_{ens}$ (strong learner) in the following way $\hat{f}_{env}=\Sigma^K_{k=1}w_k\hat{f}_k$ giving higher prediction accuracy
- The strong learner is known to over-fit the training data more than weak learners theoretically
- Some ensemble techniques such as the bagging tends to reduce problems related to over-fitting practically
- Bagging procedure turns out to be a variance reduction scheme, at least for some base procedures
- Boosting methods are primarily reducing the (model) bias of the base procedure
- Random forest is a very different ensemble method than bagging or boosting. From the perspective of prediction, random forests is about as good as boosting, and often better than bagging.
Bootstrap Samples
- Roughly speaking, the bootstrapping is a technique of copying the triaining samples many times so that we can enrich the samples
- The samples from the bootstrapping are called bootstrap samples
- There are many bootstrapping schemes but, we only consider the bootstrap samples
- Let $(y_1,\ x_1),\ …,\ (y_n,\ x_n)$ be $n$ training samples
- Randomly draw a sample from the training samples $m$ times with replacement, and then we have $(y_1^,\ x_1^),\ …,\ (y_m^,\ x_m^)$, the $m$ bootstrap samples
- We often consider $m=n$, which is like sampling an entirely new training set
- Not all of the training samples are included in the bootstrap samples, and are included more than oce
- When $m=n$, about 36.8% of the training samples are left out, for large $n$
Bagging
- Bagging is the first ensemble method [Breiman, 1996]
- Bagging is an abbreviation of bootstrap aggregating
- It stabilizes give predictive model by unifying models constructed with bootstrapped samples, enhancing unstable methods to have higher prediction accuracy
- Unstable: Slightly perturbing the training set can cause signifiant changes in the estimated model
- Among Leaarning Methods
- One of the most stable methods is the 1-nearest neighbor method
- One of the most unstable methods is regression tree
- Least square estimation is less stable than Huber’s estimation
- Subset selection method is quite unstable
- Bagging steps: Bagging constructs a unified predictive model as follows
- Obtain $B$ bootstrap samples of size $m$, $(y_{b1}^,\ x_{b1}^),\ …,\ (y_{bm}^,\ y_{bm}^,\ b\leq B$ from the training samples, $(y_1,\ x_1),\ …,\ (y_n,\ x_n)$ of size $n$
- Construct $B$ weak predictive models (individually) $\hat{f}_b,\ b\leq B$ from the $B$ bootstrap samples
- Unify the models into one predictive model $\hat{f}_{bag}$
- Averaging for regression: $\hat{f}_{bag}=\Sigma^B_{b=1}\hat{f}_b/B$
- Voting for classification: $\hat{f}_{bab}=arg\max_k\Sigma^B_{b=1}I(\hat{f}_b=k)/B$
- Breiman [1996] described heuristically the performance of bagging
- The model variance of the bagged estimator should be equal or smaller than that of the original estimator
- There can be a drastic variance reduction if the original estimator is unstable
- Tibshirani (CMU lecture note) discussed some disadvantages of bagging:
- Loss of interpretability: The final bagged classifier from trees is not a tree, and so we forfeit the clear interpretative ability of a classification tree
- Computational complexity: We are essentially multiplying the work of growing a single tree by B (especially if we are using the more involved implementation that prunes and validates on the original training data)
- An example from Ryan Tibshirani (Wisdom of crowds)
- Here is a simplified setup with $K=2$ classes to help understand the basic phenomenon
- Suppose that for a given $x$ we have $B$ independent classifies $\hat{f}_b,\ b\leq B$
- Assume each classifier has a miss-classification rate of $e=0.4$
- Assume w.l.o.g. that the true class at $x$ is 1 so that $P(\hat{f}_b(x)=2)=0.4$
- Now, we form the bagged classifier $\hat{f}_{bag}=arg\max_k\Sigma^B_{b=1}I(\hat{f}_b=k)/B$
- Let $B_k=\Sigma^B_{b=1}I(\hat{f}_b=k)$ be the number of votes for class $k$
- Notice that $B_2$ is a binomial random variable: $B_2\sim B(B,\ 0.4)$
- Hence, the miss-classification rate of the bagged classifier is
$E_2=P(\hat{f}_{bag}(x)=2)=P(B_2>B_1)=P(B_2\geq B/2)$ - Easy to see $E_2\rightarrow0$ as $B\rightarrow\infty$, in other words, $\hat{f}_{bag}$ has perfect predictive accuracy as $B\rightarrow\infty$
- What is the problem in this exaple?
- Of course, the caveat here is independence
- The classifiers that we use in practice $\hat{f}_b,\ b\leq B$ are clearly not independent, because they are on very similar data sets (Bootstrap samples from the training set)
- When will Bagging fail?
- Bagging a good classifier can improve predictive accuracy
- Bagging a bad one can seriously degrade accuracy
Random Forest
Random Forest [Breiman, 2001] is a way to create a final prediction model by using many decision trees randomly constructed
- It is a bagging the enforces more randomness
- Better prediction than bagging
- Robust to input outliers
- Easy computation than orignal tree: No pruning, Parallel computing, …
- Bagging from trees has a problem
- The trees produced from bootstrap samples can be highly correlated
- $X_i\sim(0,\ \sigma^2),\ Cor(X_i,\ X_j)=\rho,\ i,\ j\leq B\rightarrow Var(\bar{X}=\rho\sigma^2+(1-\rho)\sigma^2/B$
- The trees produced from bootstrap samples can be highly correlated
- Random Forest vs. Bagging Trees
- Constructing random forest
- Fit a decision tree to different bootstrap samples
- When growing the tree, use $q<p$ inputs randomly selected in each step
- Average the trees
- constructing bagging from trees
- Fit a decision tree to different bootstrap samples
- When growing the tree, use $p$ inputs in each step
- Average the trees
- Constructing random forest
- Boosting
- Boosting is similar to bagging in that we combine many weak predictive models
- But, boosting is quite different to bagging and sometimes can work much better
- We can see that boosting uses the whole training samples but adapts weights on the training samples
- The boosting is developed to minimize training error as fast as possible, and known to give higher prediction accuracy
- AdaBoost: Adaptive Boost
- The first boosting algorithm is the AdaBoost (Adaptive Boost) algorithm [Freund and Schapire, 1997]
- Initialize weights $w_i=1/n,\ i\leq n$
- For $m\leq M$, repeat followings
- Construct $\hat{f}_m(x)\in\lbrace-1,\ 1\rbrace$ using weights $w_i,\ i\leq n$
- Calculate $err_m=\Sigma^n_{i=1}w_iI(y_i\neq\hat{f}_m(x_i))/\Sigma^n_{i=1}w_i$
- Set $c_m=log((1-err_m)/err_m)$
- Update weights $w_i$ by $w_i=w_iexp(c_mI(y_i\neq\hat{f}_m(x_i)))$
- Construct the final model as $\hat{f}(x)=sign(\Sigma^M_{m=1}c_m\hat{f}_m(x))$
- The most interesting part in the AdaBoost algorithm is updating the weights
- If $err_m\leq 1/2$ then $exp(c_m)\geq1$
- If $y_i\neq\hat{f}_m(x_i)$ then $w_i$ increases, else does not change
- Larger weights on misclassified observations
- Forward Stagewise Regression with Square Loss
- Let $\beta^1=0$ and a small $\lambda>0$
- For $t=1,\ 2,\ …$ repeat the followings
- $(k,\ \beta_k)=arg\min_{j,\ \beta_j}\Sigma_i(y_i-\Sigma_Ix_{iI}\beta^t_I-x_{ij}\beta_j)^2$
- $\beta^{t+1}_s=\beta^t_s+\lambda sign(\beta_k)I(s=k)$
- Forward Stagewise Classification Tree with Exponential Loss
- Set $M$ trees, $T_1,\ …,\ T_M$
- Let $\beta^1=0$ and a small $\lambda>0$
- For $t=1,\ 2,\ …$ repeat the followings
- $(k,\ \beta_k)=arg\min_{j,\ \beta_j}\Sigma_iexp(-y_i\Sigma_IT_I(x_i)\beta^t_j-y_i\beta_jT_j(x_i))$
- $\beta^{t+1}_s=\beta^t_s+\lambda sign(\beta_k)I(s=k)$
- The first boosting algorithm is the AdaBoost (Adaptive Boost) algorithm [Freund and Schapire, 1997]
Model Comparison
Training-Validation Procedure
- Way of obtaining final model by selecting extra parameter, say $\lambda$, in single learning method based on given samples $\mathcal{S}$
- Step 1: Preparedifferent $\lambda_s$,say $\lambda_k$, $k\leq K$
- Step 2: Divide $\mathcal{S}$ into $\mathcal{S}_{tr}\cup\mathcal{S}_{vd}$ at random.
- Step 3 (Training): Learn models $f^{\lambda_k},\ k\leq K$ by using $\mathcal{S}_{tr}$
- Step 4 (Validation): Calculate prediction errors by using $\mathcal{S}_{vd}$
- $err(f^{\lambda_k}=\Sigma_{i\in\mathcal{S}_{vd}}L(y_i,\ f^{\lambda_k}(x_i)),\ k\leq K$
- Step 5: Select optimal $\lambda=\lambda^o$
- $o=arg\min_kerr(f^{\lambda_k})$
- Step 6: Learn final model $f^{\lambda_k}$ by using the whole $\mathcal{S}$
- Randomization
- You may repeat from Step 1 to Step 4 sufficiently many times obtaining averages of validation errors
- When the sample size is small try cross validation error
- Complexity Control
- The training-validation procedure can sometimes be a way of avoiding over-fit although not the best way: small training error but high test error
- Some learning methods have their own way or empirical experience of selecting extra parameters
Training-Validation-Test Procedure
- way of comparing two different learning methods, say $f$ and $g$ based on given samples $\mathcal{S}$
- Step 1: Divide $\mathcal{S}$ into $\mathcal{S}_{tr}\cup\mathcal{S}_{vd}\cup\mathcal{S}_{ts}$ at random
- Step 2: Do training-validation procedure based on $\mathcal{S}$ into $\mathcal{S}_{tr}\cup\mathcal{S}_{vd}$ for each method to obtain final models, say $f^o$ and $g^o$
- Step 3 (Test): Calculate prediction errors of $f^o$ and $g^o$ by using $\mathcal{S}_{st}$
- Step 4: The winner is who has smaller test error
- Randomization
- You may repeat from step 1 to step 4 sufficiently many times, obtaining averages of test errors