+ - 0:00:00
Notes for current slide
Notes for next slide

Dimension Reduction

High Dimensional Data Analysis

Anastasios Panagiotelis & Ruben Loaiza-Maya

Lecture 10

1

A proof

2

How to do PCA

  • Over the next few slides, we will derive how to get the first principal component.
3

How to do PCA

  • Over the next few slides, we will derive how to get the first principal component.
  • This eventually leads us to the eigenvalue decomposition.
3

How to do PCA

  • Over the next few slides, we will derive how to get the first principal component.
  • This eventually leads us to the eigenvalue decomposition.
  • The eigenvalue decomposition and its more general form known as the singular value decomposition are crucial to all dimension reduction techniques.
3

How to do PCA

  • Over the next few slides, we will derive how to get the first principal component.
  • This eventually leads us to the eigenvalue decomposition.
  • The eigenvalue decomposition and its more general form known as the singular value decomposition are crucial to all dimension reduction techniques.
  • This will be challenging.
3

First PC

  • Recall that the principal component is
    • The linear combination of the variables
    • With maximum variance
    • Subject to the squared weights summing to 1.
4

Linear Combination

  • Let ci=w1yi1++wpyip
5

Linear Combination

  • Let ci=w1yi1++wpyip
  • In matrix/vector form ci=wyi
5

Linear Combination

  • Let ci=w1yi1++wpyip
  • In matrix/vector form ci=wyi
  • What are the dimensions of ci, w and yi?
5

Linear Combination

  • Let ci=w1yi1++wpyip
  • In matrix/vector form ci=wyi
  • What are the dimensions of ci, w and yi?
  • Both w and yi are p×1 vectors, while ci is a scalar.
5

Variance

  • The variance of the linear combination is given by

Var(c)=i=1nci2n1=i=1n(wyi)2n1

  • Assumed that all y's have a mean of zero.
  • Everything still works without this assumption but is messier.
6

A trick

  • Note that w1yi1++wpyip has been written as wyi, but...
7

A trick

  • Note that w1yi1++wpyip has been written as wyi, but...
  • ... it can also be written as yiw.
7

A trick

  • Note that w1yi1++wpyip has been written as wyi, but...
  • ... it can also be written as yiw.
  • This in turn implies that ci2=wyiyiw. Substituting into the variance formula gives.
7

A trick

  • Note that w1yi1++wpyip has been written as wyi, but...
  • ... it can also be written as yiw.
  • This in turn implies that ci2=wyiyiw. Substituting into the variance formula gives. Var(c)=i=1nci2n1=i=1nwyiyiwn1
7

Linearity

Linearity implies that anything without an i subscript can be taken outside the summation sign.

Var(c)=i=1nci2n1=w(i=1nyiyi)wn1

8

Scalar multiplication

The order of scalar multiplication does not matter allowing the following

Var(c)=w(i=1nyiyin1)w=wSw

Recall S is the variance covariance matrix

9

Objective

We want to choose w to maximise the variance while ensuring that w12++wp2=ww=1. We write this as

maxwwSws.t.ww=1

10

Optimisation

11

Constrained Optimisation

Solving the constrained optimisation above is equivalent to solving the following unconstrained problem

maxw,λwSwλ(ww1)

12

Gradient

  • Many optimisation problems involve using the gradient or slope of an objective function.
13

Gradient

  • Many optimisation problems involve using the gradient or slope of an objective function.
  • Think of an analogy of climbing a hill.
13

Gradient

  • Many optimisation problems involve using the gradient or slope of an objective function.
  • Think of an analogy of climbing a hill.
  • If the gradient is positive then you can go higher by walking forwards.
13

Gradient

  • Many optimisation problems involve using the gradient or slope of an objective function.
  • Think of an analogy of climbing a hill.
  • If the gradient is positive then you can go higher by walking forwards.
  • If the gradient is negative then you can go higher by walking backwards.
13

Gradient

  • Many optimisation problems involve using the gradient or slope of an objective function.
  • Think of an analogy of climbing a hill.
  • If the gradient is positive then you can go higher by walking forwards.
  • If the gradient is negative then you can go higher by walking backwards.
  • The top of the hill is where the gradient is zero.
13

Gradient

  • To compute the gradient we need to use matrix calculus.
14

Gradient

  • To compute the gradient we need to use matrix calculus.
  • What does it mean to differentiate with respect to w?
14

Gradient

  • To compute the gradient we need to use matrix calculus.
  • What does it mean to differentiate with respect to w?
  • It means we differentiate with respect to w1, w2, etc
14

Gradient

  • To compute the gradient we need to use matrix calculus.
  • What does it mean to differentiate with respect to w?
  • It means we differentiate with respect to w1, w2, etc
  • All up p first derivatives are found. These can be stored in a vector.
14

First Order Conditions

Differentiating w.r.t. w gives

(wSwλ(ww1))w=2Sw2λw

Differentiating w.r.t. λ gives

(wSwλ(ww1))λ=(ww1)

15

How did we do that?

The key result is that for any square, symmetric matrix A it holds that

wAww=2Aw This is the matrix version of the rule that the derivative of aw2/w=2aw. From this result, the matrix result can be derived (but this is tedious).

16

Eigenvalue Problem

The gradient will be zero when 2Sw2λw=0 or simplifying when

Sw=λw

This is a very famous problem known as the eigenvalue problem. Suppose λ~ and w~ provide a solutions then

17

Eigenvalue Problem

The gradient will be zero when 2Sw2λw=0 or simplifying when

Sw=λw

This is a very famous problem known as the eigenvalue problem. Suppose λ~ and w~ provide a solutions then

  • The value of λ~ is called an eigenvalue
  • The vector w~ is called an eigenvector
17

Eigenvalue Problem

  • For 2×2, 3×3 and 4×4 matrices there are formulas for λ~.
18

Eigenvalue Problem

  • For 2×2, 3×3 and 4×4 matrices there are formulas for λ~.
  • These are hideous
18

Eigenvalue Problem

  • For 2×2, 3×3 and 4×4 matrices there are formulas for λ~.
  • These are hideous
  • For 5×5 and beyond there is no formula
18

Eigenvalue Problem

  • For 2×2, 3×3 and 4×4 matrices there are formulas for λ~.
  • These are hideous
  • For 5×5 and beyond there is no formula
  • A solution is found using numerical methods (i.e. a computer algorithm).
18

Geometric View

  • Recall that multiplying by a matrix moves vectors around, changing their length and direction.
19

Geometric View

  • Recall that multiplying by a matrix moves vectors around, changing their length and direction.
  • However for any matrix there will be some vector whose direction does not change, but only the length.
19

Geometric View

  • Recall that multiplying by a matrix moves vectors around, changing their length and direction.
  • However for any matrix there will be some vector whose direction does not change, but only the length.
  • This vector is an eigenvector.
19

Geometric View

  • Recall that multiplying by a matrix moves vectors around, changing their length and direction.
  • However for any matrix there will be some vector whose direction does not change, but only the length.
  • This vector is an eigenvector.
  • The extent to which the length is changed is the eigenvalue.
19

Multiple solutions

  • In general there are multiple pairs of (λ~,w~) that solve the eigenvalue problem.
20

Multiple solutions

  • In general there are multiple pairs of (λ~,w~) that solve the eigenvalue problem.
  • Which one maximises the variance?
20

Multiple solutions

  • In general there are multiple pairs of (λ~,w~) that solve the eigenvalue problem.
  • Which one maximises the variance?
  • Let w~ be an eigenvector and its associated eigenvalue be λ~.
20

Multiple solutions

  • In general there are multiple pairs of (λ~,w~) that solve the eigenvalue problem.
  • Which one maximises the variance?
  • Let w~ be an eigenvector and its associated eigenvalue be λ~.
  • What is the variance of the linear combination w~y?
20

Answer

We have already shown that the variance will be w~Sw~. Since w~ is an eigenvector it must hold that

Sw~=λ~w~

21

Answer

We have already shown that the variance will be w~Sw~. Since w~ is an eigenvector it must hold that

Sw~=λ~w~ which implies

w~Sw~=w~λ~w~=λ~w~w~

21

Variance

  • Since w~w~=1 this implies that the variance of the linear combination is λ~.
22

Variance

  • Since w~w~=1 this implies that the variance of the linear combination is λ~.
  • The weights for the first principal component is given by the eigenvector that corresponds to the largest eigenvalue.
22

Variance

  • Since w~w~=1 this implies that the variance of the linear combination is λ~.
  • The weights for the first principal component is given by the eigenvector that corresponds to the largest eigenvalue.
  • The weights of the remaining principal components are given by the other eigenvectors.
22

Matrix Decompositions

23

Spectral Theorem

Since S is a symmetric matrix it can decomposed as

S(p×p)=W(p×p)Λ(p×p)W(p×p)

24

Spectral Theorem

Since S is a symmetric matrix it can decomposed as

S(p×p)=W(p×p)Λ(p×p)W(p×p)

  • The columns of W are eigenvectors of S
24

Spectral Theorem

Since S is a symmetric matrix it can decomposed as

S(p×p)=W(p×p)Λ(p×p)W(p×p)

  • The columns of W are eigenvectors of S
  • Λ is a matrix with the eigenvalues along the main diagonal and zeros on the off diagonal.
24

Spectral Theorem

Since S is a symmetric matrix it can decomposed as

S(p×p)=W(p×p)Λ(p×p)W(p×p)

  • The columns of W are eigenvectors of S
  • Λ is a matrix with the eigenvalues along the main diagonal and zeros on the off diagonal.
  • The eigenvalues and eigenvectors can be rearranged so by convention eigenvalues in Λ are sorted from largest to smallest.
24

Rotation

  • The full vector of principal components for observation i is given by ci=Wyi
25

Rotation

  • The full vector of principal components for observation i is given by ci=Wyi
  • The eigenvectors of a symmetric matrix are also orthogonal (A proof of why this is true can be provided for anyone who is curious).
25

Rotation

  • The full vector of principal components for observation i is given by ci=Wyi
  • The eigenvectors of a symmetric matrix are also orthogonal (A proof of why this is true can be provided for anyone who is curious).
  • Orthogonality implies that the matrix of eigenvectors W is a rotation matrix.
25

Rotation

  • The full vector of principal components for observation i is given by ci=Wyi
  • The eigenvectors of a symmetric matrix are also orthogonal (A proof of why this is true can be provided for anyone who is curious).
  • Orthogonality implies that the matrix of eigenvectors W is a rotation matrix.
  • For this reason we consider PCA to be a rotation of the data.
25

PCA as an approximation

It can be shown that an equivalent way of writing the eigenvalue decomposition is

S(p×p)=W(p×p)Λ(p×p)W(p×p) =j=1pλj(1×1)wj(p×1)wj(1×p)

26

PCA as an approximation

If some eigenvalues are small they can be ignored.

S=j=1pλjwjwj j=1rλjwjwj

Only r<<p eigenvalues are used.

27

Decomposition

  • Consider a 50×50 covariance matrix.
28

Decomposition

  • Consider a 50×50 covariance matrix.
  • There are 1275 variances and covariances to estimate
28

Decomposition

  • Consider a 50×50 covariance matrix.
  • There are 1275 variances and covariances to estimate
  • Suppose the data can be summarised by just 5 factors/principal components.
28

Decomposition

  • Consider a 50×50 covariance matrix.
  • There are 1275 variances and covariances to estimate
  • Suppose the data can be summarised by just 5 factors/principal components.
  • Then the matrix can be approximated with just 5 eigenvalues and eigenvectors (255 numbers).
28

In General

  • For matrix X that is possibly non-symmetric and possibly non-square a similar decomposition known as the singular value decomposition can be used.

Y(n×p)=U(n×n)D(n×p)V(p×p)

The matrices U and V are rotations

29

Structure of D

  • If n>p [d100dp0000]
30

Structure of D

  • If n<p [d10000dn00]
  • In both cases all di>0
31

Structure of D

  • If n<p [d10000dn00]
  • In both cases all di>0
  • These are called singular values.
31

Singular Values

  • The singular values are ordered from largest to smallest allowing for an approximation

Y=j=1min(n,p)djujvj j=1rdjujvj

for r<<min(n,p)

32

Biplots and the SVD

  • When Y is the data matrix there is a connection between the biplot and the SVD.
33

Biplots and the SVD

  • When Y is the data matrix there is a connection between the biplot and the SVD.
  • For the distance biplot, the first two columns of UD are plotted as points and the first two columns of V as arrows
33

Biplots and the SVD

  • When Y is the data matrix there is a connection between the biplot and the SVD.
  • For the distance biplot, the first two columns of UD are plotted as points and the first two columns of V as arrows
  • For the correlation biplot plot the first two columns of U are plotted as points the first two columns of VD as arrows
33

Biplots and the SVD

  • When Y is the data matrix there is a connection between the biplot and the SVD.
  • For the distance biplot, the first two columns of UD are plotted as points and the first two columns of V as arrows
  • For the correlation biplot plot the first two columns of U are plotted as points the first two columns of VD as arrows
  • In general plot the first two columns of UDκ and the first two columns of VD(1κ)
33

Biplots and the SVD

  • When Y is the data matrix there is a connection between the biplot and the SVD.
  • For the distance biplot, the first two columns of UD are plotted as points and the first two columns of V as arrows
  • For the correlation biplot plot the first two columns of U are plotted as points the first two columns of VD as arrows
  • In general plot the first two columns of UDκ and the first two columns of VD(1κ)
  • In R, κ is set by the scale option of biplot
33

A final example

34

A picture

35

Pixels

  • For the computer this picture is a matrix
36

Pixels

  • For the computer this picture is a matrix
  • Each pixel on the screen has a number between 0 and 1.
36

Pixels

  • For the computer this picture is a matrix
  • Each pixel on the screen has a number between 0 and 1.
    • Numbers closer to 0 display as lighter shades of grey
36

Pixels

  • For the computer this picture is a matrix
  • Each pixel on the screen has a number between 0 and 1.
    • Numbers closer to 0 display as lighter shades of grey
    • Numbers closer to 1 display as darker shades of grey
36

Pixels

  • For the computer this picture is a matrix
  • Each pixel on the screen has a number between 0 and 1.
    • Numbers closer to 0 display as lighter shades of grey
    • Numbers closer to 1 display as darker shades of grey
  • What if we do the SVD on this matrix?
36

SVD

  • All up there are 232×218=50576 pixels.
37

SVD

  • All up there are 232×218=50576 pixels.
  • Suppose we approximate this matrix with 20 singular values
37

SVD

  • All up there are 232×218=50576 pixels.
  • Suppose we approximate this matrix with 20 singular values
  • Then U(r) is 232×20=4640
37

SVD

  • All up there are 232×218=50576 pixels.
  • Suppose we approximate this matrix with 20 singular values
  • Then U(r) is 232×20=4640
  • Then V(r) is 218×20=4360
37

SVD

  • All up there are 232×218=50576 pixels.
  • Suppose we approximate this matrix with 20 singular values
  • Then U(r) is 232×20=4640
  • Then V(r) is 218×20=4360
  • Including the 20 singular values themselves, we summarise 50576 numbers using only 4640+4360+20=9020 numbers.
37

Approximation

38

Discussion

  • Using only 20 singular values we do not lose much information.
39

Discussion

  • Using only 20 singular values we do not lose much information.
  • What if we reconstruct the picture using singular value 21 to singular value 218?
39

Discussion

  • Using only 20 singular values we do not lose much information.
  • What if we reconstruct the picture using singular value 21 to singular value 218?
  • This uses a lot more information. Does it give a clearer approximation?
39

Using remaining singular values

40

Singular values

41

Conclusion

  • The main idea is that the SVD summarises the important information in the matrix into a small number of singular values.
42

Conclusion

  • The main idea is that the SVD summarises the important information in the matrix into a small number of singular values.
  • Rotating so that we can isolate the dimensions associated with those singular values is the geometry behind dimension reduction.
42

Conclusion

  • The main idea is that the SVD summarises the important information in the matrix into a small number of singular values.
  • Rotating so that we can isolate the dimensions associated with those singular values is the geometry behind dimension reduction.
  • This applies to PCA, factor analysis and MDS as well as to compressing images.
42

A proof

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow