class: center, middle, inverse, title-slide .title[ # Dimension Reduction:PCA ] .author[ ### Anastasios Panagiotelis ] .institute[ ### University of Sydney ] --- # Outline - What is PCA? -- - Application of PCA -- - Algebraic understanding -- - Geometric understanding -- - Latent factor model understanding --- class: center, middle, inverse # Principal Components Analysis --- # Explaining Variance - Let there be `\(n\)` observations of `\(p\)` variables; `\(x_{ij}\)` denotes observation `\(i\)` and variable `\(j\)`. -- - Find some linear combination of variables that has maximal variance. -- - Find `\(w_1,w_2,\dots,w_p\)` such that `$$y_i=w_1x_{i1}+w_2x_{i2}+\dots w_px_{ip}$$` has the biggest possible variance. -- - This is the first principal component (PC). --- # More PCs - After finding the first principal component can look for a linear combination that -- + Has maximum variance + Is uncorrelated with the first PC -- - This is called the second principal component -- - This continues until there are as many PCs as variables. --- # No cheating... - Arbitrarily big weights -- `\(\rightarrow\)` arbitrarily big variance. -- + Constrain `\(\sum w^2_j=1\)` -- - Sensitive to units of measurement. -- + Center all variables by subtracting the mean. + Standardise all variables to have unit variance. --- class: center, middle, inverse #An application --- # Implementation R Code to implement PCA for World Bank Data -- ```r library(tidyverse) library(broom) wb<-read_csv('../data/WorldBankClean.csv') wb%>% select_if(.,is.numeric)%>% #Use numeric data scale()%>% #Standardise prcomp()->pca #Compute PCs wbPC<-augment(pca,wb) #Add PCs to dataframe ``` --- # Explaining variance - The variance of the first PC is 28.81. -- + This represents 44.32% of the total variance of the data. -- - The variance of the second PC is 7.88. -- + This represents 12.12% of the total variance of the data. -- - Together the first 5 PCs represent 77.48% of the total variance of the data. --- # Scree plot <img src="02PCA_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Plot
--- # Uncovering Structure - Countries towards the right tend to be more economically developed. -- - Countries towards the bottom tend to be larger in population. -- - Countries that are similar to one another are closer together on the plot. -- - A small number of PCs explains a large proportion of variance. --- class: middle, center, inverse # PCA: The Algebra --- # PCA as optimisation - LC given by `\(\by=\bX\bw\)` -- - Variance of LC: `\(\frac{1}{n-1}\sum_{i=1}^n y^2_i=\frac{1}{n-1}\by'\by\)` -- - Optimisation problem is `$$\underset{\bw}{\max}\,\frac{1}{n-1}\bw'\bX'\bX\bw$$` subject to `\(\bw'\bw=1\)` -- - Substitute `\(\bS=\frac{1}{n-1}\bX'\bX\)` --- # Solution - Lagrangian is `$$\calL=\bw'\bS\bw-\lambda(\bw'\bw-1)$$` -- - A first order condition is `$$\frac{\partial\calL}{\partial{\bw}}=2\bS\bw-2\lambda\bw$$` -- - Need to find `\(\mathbf{w}\)` to satisfy `$$\bS\bw=\lambda\bw$$` --- # Eigenvalue Decomposition - Solutions are given by the eigenvalue decomposition. -- - There are multiple solutions. The eigenvector corresponding to the largest eigenvalue gives the weights of the first principal component. -- - The eigenvector corresponding the the second largest eigenvalue gives the weights of the second principal component. -- - And so on... --- # Data compression - When `\(\lambda_j\)` / `\(\bw_j\)` are eigenvalues/eigenvectors `$$\bS=\sum_{j=1}^p \lambda_j\bw_j\bw_j'$$` - This can be approximated by `$$\bS\approx\sum_{j=1}^{\color{blue}{m}} \lambda_j\bw_j\bw_j'$$` --- class: inverse, middle, center # PCA: The geometry --- # Rotations - For symmetric p.s.d matrices, the matrix of eigenvectors `\(\bW\)` is a rotation matrix -- + Columns/rows are orthogonal + Columns/rows have unit length -- - Multiplying a vector by rotation matrix literally rotates that vector. --- # Rotation is PCA - Principal components given by `\(\bY=\bX\bW\)` -- - Each observation (row of `\(\bX\)`) is rotated to new components -- - This is best seen with a simple example --- # A simple case <img src="02PCA_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> IT.NET.USER.ZS = No. people using internet SH.ANM.NPRG.ZS = Prev. anaemia non-preg. --- # Components <img src="02PCA_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Animation <img src="02PCA_files/figure-html/anim-.gif" style="display: block; margin: auto;" /> --- # Or as new coordinates <img src="02PCA_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Or as new coordinates <img src="02PCA_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> First PC projects onto orange line, second PC on to blue line. --- # PCA and Factor Models - Suppose the data are generated from the following statistical model `$$\bx_i=\bA\by_i+\boldsymbol{\xi}_i$$` -- - where + `\(\bx_i\)` is a `\(p\times 1\)` data vector, + `\(\by_i\)` is a `\(m\times 1\)` latent factor, + `\(\bA\)` are factor loadings, + `\(\boldsymbol{\xi}_i\)` is a `\(p\times 1\)` error vector. -- - The `\(\by_i\)` can be estimated using PCs --- # Summary - PCA can be thought of as: -- + Compressing data with matrix decomposition. + Rotating the data. + Constructing new coordinates. + Projecting onto a low-dimensional hyper-plane. + A technique to estimate latent factors. -- - All of these intuitions are useful. --- class: inverse, center, middle #Questions?