class: center, middle, inverse, title-slide .title[ # Dimension Reduction:Kernel PCA ] .author[ ### Anastasios Panagiotelis ] .institute[ ### University of Sydney ] --- # Outline - Feature mapping -- + Why we sometimes need to go to higher dimensions to get to lower dimensions. -- - Kernel trick -- + How we can compute inner products of high (even infinite) dimensional transformations of the data using only our original data -- - Kernel PCA algorithm -- + Using this for PCA. --- class: center, middle, inverse # Feature Mapping --- # Non linear PCA - Regular PCA gives principal components that are linear functions of the data. -- - As a rough idea consider including polynomials `\(x^2, x^3,x^4,\dots\)` as well as cross products. -- - In general, let `\(\Phi:\bbR^p\rightarrow\bbR^P\)` where `\(P>p\)` be a feature map. -- - Rather than carry out principal components on `\(\bx\)`, carry out principal components on `\(\Phi(\bx)\)` --- # Question - Aren't we doing dimension reduction? -- - Yes, however the idea is that a linear algorithm (like PCA) can only reveal patterns if we transform to a higher dimension first. -- - Consider this example --- # Circles <img src="04KPCA_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Circles example - Suppose we use PCA to go from 2 dimensions to 1 dimension. -- - Since PCA can only rotate the data, there is no way to separate out the black and yellow points. -- - This may be an important feature that we want to identify in lower dimensional space. --- # First PC of circles example <img src="04KPCA_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Add dimension - Consider adding a new variable given by `\(x_3=x_1^2+x_2^2\)` -- - Then we we carry out PCA on `\(x_1\)`, `\(x_2\)` and `\(x_3\)` -- - Will the first principal component separate out the black and yellow points. --- # More dimensions
--- # First PC with extra feature <img src="04KPCA_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Problems - How to compute feature space? -- - Will this be computationally feasible? -- - There is a trick we can use so that we never have to compute the feature space -- - This is known as the 'kernel' trick and is pervasive in machine learning. --- class: inverse, middle, center # The kernel trick --- # The kernel trick - Consider `\(\Phi(\bx)\)` which is an `\(P\)` dimensional vector. -- - In many algorithms (including PCA) the solution can be reworked in terms of `\(\langle\Phi(\bx),\Phi(\bz)\rangle\)` (or if you prefer `\(\Phi(\bx)'\Phi(\bz)\)`). -- - The inner product in the feature space, can be written as a function of `\(\bx\)` and `\(\bz\)`, known as the kernel `$$\langle\Phi(\bx),\Phi(\bz)\rangle=K(\bx,\bz)$$` --- # An example - Consider the mapping `\(\Phi(\bx)=\left(x_1^2,\dots,x_p^2,\right.\)`-- `\(\sqrt{2}x_1x_2,\dots,\sqrt{2}x_{p-1}x_p,\)`-- `\(\sqrt{2}x_1,\dots,\sqrt{2}x_p,\)`-- `\(\left.1\right)\)` -- - We have gone from an `\(p\)`-dimensional vector to a `\(P=(p(p+3))/2)+1\)` dimensional vector. -- - The kernel function is `$$K(\bx,\bz)=(\langle\bx,\bz\rangle+1)^2=\langle\Phi(\bx),\Phi(\bz)\rangle$$` -- - This is called a polynomial kernel. --- # Other kernels - Other kernels are available -- + RBF kernel + Hyperbolic tangent kernel -- - These may provide the inner product for infinite dimensional feature spaces. -- - The feature space may not be unique. --- class: inverse, middle, center # Kernel PCA --- # A PCA refresher - The standard way to do PCA is to solve the eigenvalue problem -- `$$\bS\bv=\lambda\bv$$` -- - This is equivalent to -- `$$\frac{1}{n-1}\bX'\bX\bv=\lambda\bv$$` - The principal components are given by `\(\bX\bv\)` (not by `\(\bv\)`). --- # A silly way to do PCA - Take -- `$$\frac{1}{n-1}\bX'\bX\bv=\lambda\bv$$` -- - Pre-multiply by `\(\bX\)` and rearrange -- `$$\bX\bX'\bX\bv=(n-1)\lambda\bX\bv$$` - Replace `\(\bX\bv\)` with `\(\by\)` and `\((n-1)\lambda\)` with `\(\tilde{\lambda}\)` -- `$$\bX\bX'\by=\tilde{\lambda}\by$$` --- # Kernel matrix - Principal components could be found by finding eigenvectors of `\(\bX\bX'\)` instead of `\(\frac{1}{n}\bX'\bX\)`. -- - Normally you wouldn't do this since `\(\bX\bX'\)` will be a bigger matrix than `\(\bX'\bX\)` -- - Unless... -- rather than `\(\bX\)` with rows `\(\bx_i\)` you have a matrix with rows `\(\Phi(\bx_i)\)`. -- + Then `\(\bX\bX'\)` becomes `\(\bK\)`. -- + Here `\(\bK\)` is the **kernel matrix** an `\(n\times n\)` matrix with elements `\(K(\bx_i,\bx_j)\)`. --- # Some caveats - That is a simpler treatment than what you will find in many references. -- - A more formal treatment has to deal more carefully with the possible infinite dimensionality of the feature space. -- - Care must be taken to center `\(\bK\)` by pre and post multiplying by the centering matrix `\(\bI-n^{-1}\iota\iota'\)` -- - Some sources will rescale the eigenvectors of `\(\bK\)` by `\(n-1\)` or `\(\lambda\)`. --- # Demonstration - Recall the world bank data. -- - This data was analysed using PCA. -- - On the following slide, the PCA output will be shown again -- - After that we will work through applying kernel PCA to the same dataset. --- # Standard PCA <img src="04KPCA_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # The dimRed package - The [`dimRed` package](https://github.com/gdkrmr/dimRed) provides a unified framework for many dimension reduction techniques. -- - It uses an S4 class for handling data with two slots -- + The slot `data` contains data (measures of development) + The slot `meta` contains other information (country names) --- # Kernel PCA in dimRed - By default dimRed uses the radial basis kernel -- `$$K(\bx,\bz)=\exp\left(-\frac{||\bx-\bz||^2}{2\sigma^2}\right)$$` -- - The default value of the tuning parameter is `\(\sigma=0.1\)` --- # Code ```r library(tidyverse) library(dimRed) read_csv('../data/WorldBankClean.csv')%>% #Read Data mutate_if(is.numeric,scale)%>% #Scale Data as.dimRedData(`Country Name` + `Country Code`~.,data=.)->wb #Convert to S4 class kpcaout <- embed(.data = wb, .method="kPCA") df<-tibble(cbind(kpcaout@data@meta,kpcaout@data@data)) # Convert back to a dataframe ggplot(df,aes(x=kPCA1,y=kPCA2,label=`Country Code`))+geom_text(size=2) #Plot ``` --- # Kernel PCA <img src="04KPCA_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # With different tuning parameter ```r kpcaout <- embed(.data = wb, .method="kPCA",kpar=list(sigma=0.001)) ``` <img src="04KPCA_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # With different kernel ```r kpcaout <- embed(.data = wb, .method="kPCA",kernel='tanhdot',kpar=list(scale=1)) ``` <img src="04KPCA_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Image example - Read in png files of images of the letter 'A' rotated and rescaled. -- - There are 124848 variables. -- - Only 10898 of these variable have any variation across the images. -- - Run PCA and kPCA only on these variables. -- - Lesson: Don't try a complicated dimension reduction technique when a simple one is better. --- # PCA <img src="04KPCA_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Kernel PCA <img src="04KPCA_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Conclusion - Kernel PCA has the advantage of being a non-linear dimension reduction technique. -- - Some disadvantages include: -- + Need to choose a kernel. + Need to choose tuning parameters for the kernel. + Eigendecomposition is slow for large `\(n\)`, although this can be mitigated by using a kernel that imposes sparsity. --- class: center, inverse, middle # Questions?