class: center, middle, inverse, title-slide # Principal Components Analysis ## High Dimensional Data Analysis ### Anastasios Panagiotelis & Ruben Loaiza-Maya ### Lecture 6 --- class: inverse, center, middle # Motivation --- # High Dimensional Data - In **marketing** surveys we may ask a large number of questions about customer experience.<!--D--> -- - In **finance** there may be several ways to assess the credit worthiness of firms.<!--D--> -- - In **economics** the development of a country or state can be measured in different ways. --- # A real example - Consider a dataset with the following variables for the 50 States of the USA<!--D--> -- + Income + Illiteracy + Life Expectancy + Murder Rate + High School Graduation Rate<!--D--> -- - You can access this via moodle from the file *StateSE.rds*<!--D--> --- # Summarising many variables - Often we aim to combine many variables into a single index<!--D--> -- + In finance a credit score summarises all the information about the likelihood of bankruptcy for a company.<!--D--> -- + In marketing we require a single overall measure of customer experience.<!--D--> -- + In economics the Human Development Index is a single measure that takes income, education and health into account. --- # Weighted linear combination - A convenient way to combine variables is through a *linear combination* (LC)<!--D--> -- + For example, your grade for this unit: $$ w_1\mbox{Assign. Marks}+w_2\mbox{Exam Mark} $$ + Here `\(w_1\)` and `\(w_2\)` are called *weights* + In this unit, the weight for the Assignments is *50%* and for the Examination is *50%*<!--D--> -- - What is a good way to choose weights? --- # Maximise variance - The purpose of grading students is to differentiate the best perfoming students from the weakest performing students <!--D--> -- - The index should have *large variance*.<!--D--> -- - The LC with the highest variance is the **first Principal Component** of the data.<!--D--> -- - The first principal component is a new variable that *explains* as much variance as possible in the original variables. --- # Original Data <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> State </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Income </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Illiteracy </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> LifeExp </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Murder </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> HSGrad </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> StateAbb </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Alabama </td> <td style="text-align:right;"> 3624 </td> <td style="text-align:right;"> 2.1 </td> <td style="text-align:right;"> 69.05 </td> <td style="text-align:right;"> 15.1 </td> <td style="text-align:right;"> 41.3 </td> <td style="text-align:left;"> AL </td> </tr> <tr> <td style="text-align:left;"> Alaska </td> <td style="text-align:right;"> 6315 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> 69.31 </td> <td style="text-align:right;"> 11.3 </td> <td style="text-align:right;"> 66.7 </td> <td style="text-align:left;"> AK </td> </tr> <tr> <td style="text-align:left;"> Arizona </td> <td style="text-align:right;"> 4530 </td> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> 70.55 </td> <td style="text-align:right;"> 7.8 </td> <td style="text-align:right;"> 58.1 </td> <td style="text-align:left;"> AZ </td> </tr> <tr> <td style="text-align:left;"> Arkansas </td> <td style="text-align:right;"> 3378 </td> <td style="text-align:right;"> 1.9 </td> <td style="text-align:right;"> 70.66 </td> <td style="text-align:right;"> 10.1 </td> <td style="text-align:right;"> 39.9 </td> <td style="text-align:left;"> AR </td> </tr> <tr> <td style="text-align:left;"> California </td> <td style="text-align:right;"> 5114 </td> <td style="text-align:right;"> 1.1 </td> <td style="text-align:right;"> 71.71 </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:right;"> 62.6 </td> <td style="text-align:left;"> CA </td> </tr> <tr> <td style="text-align:left;"> Colorado </td> <td style="text-align:right;"> 4884 </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 72.06 </td> <td style="text-align:right;"> 6.8 </td> <td style="text-align:right;"> 63.9 </td> <td style="text-align:left;"> CO </td> </tr> <tr> <td style="text-align:left;"> Connecticut </td> <td style="text-align:right;"> 5348 </td> <td style="text-align:right;"> 1.1 </td> <td style="text-align:right;"> 72.48 </td> <td style="text-align:right;"> 3.1 </td> <td style="text-align:right;"> 56.0 </td> <td style="text-align:left;"> CT </td> </tr> <tr> <td style="text-align:left;"> Delaware </td> <td style="text-align:right;"> 4809 </td> <td style="text-align:right;"> 0.9 </td> <td style="text-align:right;"> 70.06 </td> <td style="text-align:right;"> 6.2 </td> <td style="text-align:right;"> 54.6 </td> <td style="text-align:left;"> DE </td> </tr> <tr> <td style="text-align:left;"> Florida </td> <td style="text-align:right;"> 4815 </td> <td style="text-align:right;"> 1.3 </td> <td style="text-align:right;"> 70.66 </td> <td style="text-align:right;"> 10.7 </td> <td style="text-align:right;"> 52.6 </td> <td style="text-align:left;"> FL </td> </tr> <tr> <td style="text-align:left;"> Georgia </td> <td style="text-align:right;"> 4091 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 68.54 </td> <td style="text-align:right;"> 13.9 </td> <td style="text-align:right;"> 40.6 </td> <td style="text-align:left;"> GA </td> </tr> <tr> <td style="text-align:left;"> Hawaii </td> <td style="text-align:right;"> 4963 </td> <td style="text-align:right;"> 1.9 </td> <td style="text-align:right;"> 73.60 </td> <td style="text-align:right;"> 6.2 </td> <td style="text-align:right;"> 61.9 </td> <td style="text-align:left;"> HI </td> </tr> <tr> <td style="text-align:left;"> Idaho </td> <td style="text-align:right;"> 4119 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 71.87 </td> <td style="text-align:right;"> 5.3 </td> <td style="text-align:right;"> 59.5 </td> <td style="text-align:left;"> ID </td> </tr> <tr> <td style="text-align:left;"> Illinois </td> <td style="text-align:right;"> 5107 </td> <td style="text-align:right;"> 0.9 </td> <td style="text-align:right;"> 70.14 </td> <td style="text-align:right;"> 10.3 </td> <td style="text-align:right;"> 52.6 </td> <td style="text-align:left;"> IL </td> </tr> <tr> <td style="text-align:left;"> Indiana </td> <td style="text-align:right;"> 4458 </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 70.88 </td> <td style="text-align:right;"> 7.1 </td> <td style="text-align:right;"> 52.9 </td> <td style="text-align:left;"> IN </td> </tr> <tr> <td style="text-align:left;"> Iowa </td> <td style="text-align:right;"> 4628 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 72.56 </td> <td style="text-align:right;"> 2.3 </td> <td style="text-align:right;"> 59.0 </td> <td style="text-align:left;"> IA </td> </tr> <tr> <td style="text-align:left;"> Kansas </td> <td style="text-align:right;"> 4669 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 72.58 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 59.9 </td> <td style="text-align:left;"> KS </td> </tr> <tr> <td style="text-align:left;"> Kentucky </td> <td style="text-align:right;"> 3712 </td> <td style="text-align:right;"> 1.6 </td> <td style="text-align:right;"> 70.10 </td> <td style="text-align:right;"> 10.6 </td> <td style="text-align:right;"> 38.5 </td> <td style="text-align:left;"> KY </td> </tr> <tr> <td style="text-align:left;"> Louisiana </td> <td style="text-align:right;"> 3545 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 68.76 </td> <td style="text-align:right;"> 13.2 </td> <td style="text-align:right;"> 42.2 </td> <td style="text-align:left;"> LA </td> </tr> <tr> <td style="text-align:left;"> Maine </td> <td style="text-align:right;"> 3694 </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 70.39 </td> <td style="text-align:right;"> 2.7 </td> <td style="text-align:right;"> 54.7 </td> <td style="text-align:left;"> ME </td> </tr> <tr> <td style="text-align:left;"> Maryland </td> <td style="text-align:right;"> 5299 </td> <td style="text-align:right;"> 0.9 </td> <td style="text-align:right;"> 70.22 </td> <td style="text-align:right;"> 8.5 </td> <td style="text-align:right;"> 52.3 </td> <td style="text-align:left;"> MD </td> </tr> <tr> <td style="text-align:left;"> Massachusetts </td> <td style="text-align:right;"> 4755 </td> <td style="text-align:right;"> 1.1 </td> <td style="text-align:right;"> 71.83 </td> <td style="text-align:right;"> 3.3 </td> <td style="text-align:right;"> 58.5 </td> <td style="text-align:left;"> MA </td> </tr> <tr> <td style="text-align:left;"> Michigan </td> <td style="text-align:right;"> 4751 </td> <td style="text-align:right;"> 0.9 </td> <td style="text-align:right;"> 70.63 </td> <td style="text-align:right;"> 11.1 </td> <td style="text-align:right;"> 52.8 </td> <td style="text-align:left;"> MI </td> </tr> <tr> <td style="text-align:left;"> Minnesota </td> <td style="text-align:right;"> 4675 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 72.96 </td> <td style="text-align:right;"> 2.3 </td> <td style="text-align:right;"> 57.6 </td> <td style="text-align:left;"> MN </td> </tr> <tr> <td style="text-align:left;"> Mississippi </td> <td style="text-align:right;"> 3098 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 68.09 </td> <td style="text-align:right;"> 12.5 </td> <td style="text-align:right;"> 41.0 </td> <td style="text-align:left;"> MS </td> </tr> <tr> <td style="text-align:left;"> Missouri </td> <td style="text-align:right;"> 4254 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 70.69 </td> <td style="text-align:right;"> 9.3 </td> <td style="text-align:right;"> 48.8 </td> <td style="text-align:left;"> MO </td> </tr> <tr> <td style="text-align:left;"> Montana </td> <td style="text-align:right;"> 4347 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 70.56 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 59.2 </td> <td style="text-align:left;"> MT </td> </tr> <tr> <td style="text-align:left;"> Nebraska </td> <td style="text-align:right;"> 4508 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 72.60 </td> <td style="text-align:right;"> 2.9 </td> <td style="text-align:right;"> 59.3 </td> <td style="text-align:left;"> NE </td> </tr> <tr> <td style="text-align:left;"> Nevada </td> <td style="text-align:right;"> 5149 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 69.03 </td> <td style="text-align:right;"> 11.5 </td> <td style="text-align:right;"> 65.2 </td> <td style="text-align:left;"> NV </td> </tr> <tr> <td style="text-align:left;"> New Hampshire </td> <td style="text-align:right;"> 4281 </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 71.23 </td> <td style="text-align:right;"> 3.3 </td> <td style="text-align:right;"> 57.6 </td> <td style="text-align:left;"> NH </td> </tr> <tr> <td style="text-align:left;"> New Jersey </td> <td style="text-align:right;"> 5237 </td> <td style="text-align:right;"> 1.1 </td> <td style="text-align:right;"> 70.93 </td> <td style="text-align:right;"> 5.2 </td> <td style="text-align:right;"> 52.5 </td> <td style="text-align:left;"> NJ </td> </tr> <tr> <td style="text-align:left;"> New Mexico </td> <td style="text-align:right;"> 3601 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 70.32 </td> <td style="text-align:right;"> 9.7 </td> <td style="text-align:right;"> 55.2 </td> <td style="text-align:left;"> NM </td> </tr> <tr> <td style="text-align:left;"> New York </td> <td style="text-align:right;"> 4903 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 70.55 </td> <td style="text-align:right;"> 10.9 </td> <td style="text-align:right;"> 52.7 </td> <td style="text-align:left;"> NY </td> </tr> <tr> <td style="text-align:left;"> North Carolina </td> <td style="text-align:right;"> 3875 </td> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> 69.21 </td> <td style="text-align:right;"> 11.1 </td> <td style="text-align:right;"> 38.5 </td> <td style="text-align:left;"> NC </td> </tr> <tr> <td style="text-align:left;"> North Dakota </td> <td style="text-align:right;"> 5087 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 72.78 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 50.3 </td> <td style="text-align:left;"> ND </td> </tr> <tr> <td style="text-align:left;"> Ohio </td> <td style="text-align:right;"> 4561 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> 70.82 </td> <td style="text-align:right;"> 7.4 </td> <td style="text-align:right;"> 53.2 </td> <td style="text-align:left;"> OH </td> </tr> <tr> <td style="text-align:left;"> Oklahoma </td> <td style="text-align:right;"> 3983 </td> <td style="text-align:right;"> 1.1 </td> <td style="text-align:right;"> 71.42 </td> <td style="text-align:right;"> 6.4 </td> <td style="text-align:right;"> 51.6 </td> <td style="text-align:left;"> OK </td> </tr> <tr> <td style="text-align:left;"> Oregon </td> <td style="text-align:right;"> 4660 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 72.13 </td> <td style="text-align:right;"> 4.2 </td> <td style="text-align:right;"> 60.0 </td> <td style="text-align:left;"> OR </td> </tr> <tr> <td style="text-align:left;"> Pennsylvania </td> <td style="text-align:right;"> 4449 </td> <td style="text-align:right;"> 1.0 </td> <td style="text-align:right;"> 70.43 </td> <td style="text-align:right;"> 6.1 </td> <td style="text-align:right;"> 50.2 </td> <td style="text-align:left;"> PA </td> </tr> <tr> <td style="text-align:left;"> Rhode Island </td> <td style="text-align:right;"> 4558 </td> <td style="text-align:right;"> 1.3 </td> <td style="text-align:right;"> 71.90 </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> 46.4 </td> <td style="text-align:left;"> RI </td> </tr> <tr> <td style="text-align:left;"> South Carolina </td> <td style="text-align:right;"> 3635 </td> <td style="text-align:right;"> 2.3 </td> <td style="text-align:right;"> 67.96 </td> <td style="text-align:right;"> 11.6 </td> <td style="text-align:right;"> 37.8 </td> <td style="text-align:left;"> SC </td> </tr> <tr> <td style="text-align:left;"> South Dakota </td> <td style="text-align:right;"> 4167 </td> <td style="text-align:right;"> 0.5 </td> <td style="text-align:right;"> 72.08 </td> <td style="text-align:right;"> 1.7 </td> <td style="text-align:right;"> 53.3 </td> <td style="text-align:left;"> SD </td> </tr> <tr> <td style="text-align:left;"> Tennessee </td> <td style="text-align:right;"> 3821 </td> <td style="text-align:right;"> 1.7 </td> <td style="text-align:right;"> 70.11 </td> <td style="text-align:right;"> 11.0 </td> <td style="text-align:right;"> 41.8 </td> <td style="text-align:left;"> TN </td> </tr> <tr> <td style="text-align:left;"> Texas </td> <td style="text-align:right;"> 4188 </td> <td style="text-align:right;"> 2.2 </td> <td style="text-align:right;"> 70.90 </td> <td style="text-align:right;"> 12.2 </td> <td style="text-align:right;"> 47.4 </td> <td style="text-align:left;"> TX </td> </tr> <tr> <td style="text-align:left;"> Utah </td> <td style="text-align:right;"> 4022 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 72.90 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:right;"> 67.3 </td> <td style="text-align:left;"> UT </td> </tr> <tr> <td style="text-align:left;"> Vermont </td> <td style="text-align:right;"> 3907 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 71.64 </td> <td style="text-align:right;"> 5.5 </td> <td style="text-align:right;"> 57.1 </td> <td style="text-align:left;"> VT </td> </tr> <tr> <td style="text-align:left;"> Virginia </td> <td style="text-align:right;"> 4701 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 70.08 </td> <td style="text-align:right;"> 9.5 </td> <td style="text-align:right;"> 47.8 </td> <td style="text-align:left;"> VA </td> </tr> <tr> <td style="text-align:left;"> Washington </td> <td style="text-align:right;"> 4864 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 71.72 </td> <td style="text-align:right;"> 4.3 </td> <td style="text-align:right;"> 63.5 </td> <td style="text-align:left;"> WA </td> </tr> <tr> <td style="text-align:left;"> West Virginia </td> <td style="text-align:right;"> 3617 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 69.48 </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 41.6 </td> <td style="text-align:left;"> WV </td> </tr> <tr> <td style="text-align:left;"> Wisconsin </td> <td style="text-align:right;"> 4468 </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 72.48 </td> <td style="text-align:right;"> 3.0 </td> <td style="text-align:right;"> 54.5 </td> <td style="text-align:left;"> WI </td> </tr> <tr> <td style="text-align:left;"> Wyoming </td> <td style="text-align:right;"> 4566 </td> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 70.29 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 62.9 </td> <td style="text-align:left;"> WY </td> </tr> </tbody> </table></div> --- # First PC <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> State </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> .fittedPC1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Alabama </td> <td style="text-align:right;"> -3.4736429 </td> </tr> <tr> <td style="text-align:left;"> Alaska </td> <td style="text-align:right;"> 0.5523458 </td> </tr> <tr> <td style="text-align:left;"> Arizona </td> <td style="text-align:right;"> -0.3218179 </td> </tr> <tr> <td style="text-align:left;"> Arkansas </td> <td style="text-align:right;"> -2.3518240 </td> </tr> <tr> <td style="text-align:left;"> California </td> <td style="text-align:right;"> 0.9138319 </td> </tr> <tr> <td style="text-align:left;"> Colorado </td> <td style="text-align:right;"> 1.7319349 </td> </tr> <tr> <td style="text-align:left;"> Connecticut </td> <td style="text-align:right;"> 1.8293070 </td> </tr> <tr> <td style="text-align:left;"> Delaware </td> <td style="text-align:right;"> 0.3708443 </td> </tr> <tr> <td style="text-align:left;"> Florida </td> <td style="text-align:right;"> -0.4071974 </td> </tr> <tr> <td style="text-align:left;"> Georgia </td> <td style="text-align:right;"> -3.2000232 </td> </tr> <tr> <td style="text-align:left;"> Hawaii </td> <td style="text-align:right;"> 1.3275139 </td> </tr> <tr> <td style="text-align:left;"> Idaho </td> <td style="text-align:right;"> 1.2443096 </td> </tr> <tr> <td style="text-align:left;"> Illinois </td> <td style="text-align:right;"> -0.0586612 </td> </tr> <tr> <td style="text-align:left;"> Indiana </td> <td style="text-align:right;"> 0.4059830 </td> </tr> <tr> <td style="text-align:left;"> Iowa </td> <td style="text-align:right;"> 2.1960892 </td> </tr> <tr> <td style="text-align:left;"> Kansas </td> <td style="text-align:right;"> 1.9256885 </td> </tr> <tr> <td style="text-align:left;"> Kentucky </td> <td style="text-align:right;"> -2.2652570 </td> </tr> <tr> <td style="text-align:left;"> Louisiana </td> <td style="text-align:right;"> -3.8826563 </td> </tr> <tr> <td style="text-align:left;"> Maine </td> <td style="text-align:right;"> 0.4547571 </td> </tr> <tr> <td style="text-align:left;"> Maryland </td> <td style="text-align:right;"> 0.2844478 </td> </tr> <tr> <td style="text-align:left;"> Massachusetts </td> <td style="text-align:right;"> 1.3868972 </td> </tr> <tr> <td style="text-align:left;"> Michigan </td> <td style="text-align:right;"> -0.1768465 </td> </tr> <tr> <td style="text-align:left;"> Minnesota </td> <td style="text-align:right;"> 2.2025281 </td> </tr> <tr> <td style="text-align:left;"> Mississippi </td> <td style="text-align:right;"> -4.0362219 </td> </tr> <tr> <td style="text-align:left;"> Missouri </td> <td style="text-align:right;"> -0.3652702 </td> </tr> <tr> <td style="text-align:left;"> Montana </td> <td style="text-align:right;"> 0.9359256 </td> </tr> <tr> <td style="text-align:left;"> Nebraska </td> <td style="text-align:right;"> 2.0060961 </td> </tr> <tr> <td style="text-align:left;"> Nevada </td> <td style="text-align:right;"> 0.4719808 </td> </tr> <tr> <td style="text-align:left;"> New Hampshire </td> <td style="text-align:right;"> 1.1727342 </td> </tr> <tr> <td style="text-align:left;"> New Jersey </td> <td style="text-align:right;"> 0.7618589 </td> </tr> <tr> <td style="text-align:left;"> New Mexico </td> <td style="text-align:right;"> -1.6465196 </td> </tr> <tr> <td style="text-align:left;"> New York </td> <td style="text-align:right;"> -0.4937635 </td> </tr> <tr> <td style="text-align:left;"> North Carolina </td> <td style="text-align:right;"> -2.7036034 </td> </tr> <tr> <td style="text-align:left;"> North Dakota </td> <td style="text-align:right;"> 1.9049237 </td> </tr> <tr> <td style="text-align:left;"> Ohio </td> <td style="text-align:right;"> 0.3444655 </td> </tr> <tr> <td style="text-align:left;"> Oklahoma </td> <td style="text-align:right;"> 0.0227251 </td> </tr> <tr> <td style="text-align:left;"> Oregon </td> <td style="text-align:right;"> 1.8066483 </td> </tr> <tr> <td style="text-align:left;"> Pennsylvania </td> <td style="text-align:right;"> -0.0242343 </td> </tr> <tr> <td style="text-align:left;"> Rhode Island </td> <td style="text-align:right;"> 0.5548203 </td> </tr> <tr> <td style="text-align:left;"> South Carolina </td> <td style="text-align:right;"> -3.7722712 </td> </tr> <tr> <td style="text-align:left;"> South Dakota </td> <td style="text-align:right;"> 1.5131049 </td> </tr> <tr> <td style="text-align:left;"> Tennessee </td> <td style="text-align:right;"> -2.1379510 </td> </tr> <tr> <td style="text-align:left;"> Texas </td> <td style="text-align:right;"> -1.8743614 </td> </tr> <tr> <td style="text-align:left;"> Utah </td> <td style="text-align:right;"> 2.0995090 </td> </tr> <tr> <td style="text-align:left;"> Vermont </td> <td style="text-align:right;"> 0.8805572 </td> </tr> <tr> <td style="text-align:left;"> Virginia </td> <td style="text-align:right;"> -0.8810536 </td> </tr> <tr> <td style="text-align:left;"> Washington </td> <td style="text-align:right;"> 1.9687535 </td> </tr> <tr> <td style="text-align:left;"> West Virginia </td> <td style="text-align:right;"> -1.7131805 </td> </tr> <tr> <td style="text-align:left;"> Wisconsin </td> <td style="text-align:right;"> 1.5728437 </td> </tr> <tr> <td style="text-align:left;"> Wyoming </td> <td style="text-align:right;"> 0.9429316 </td> </tr> </tbody> </table></div> --- # First PC on Map <img src="PCA_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Second Principal Component - Sometimes a single index still oversimplifies the data.<!--D--> -- - The second principal component is an LC that<!--D--> -- 1. Is uncorrelated with the first PC. 2. Has the highest variance out of all LCs that satisfy condition 1.<!--D--> -- - Since there is no need for PC2 to *explain* any variance already explained by PC1, PC2 and PC1 are uncorrelated.<!--D--> -- - We can plot the first two principal components on a scatter plot. --- # Scatter-plot of PCs
--- # The weights <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> PC1 </th> <th style="text-align:right;"> PC2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:right;"> 0.3473146 </td> <td style="text-align:right;"> 0.7315324 </td> </tr> <tr> <td style="text-align:left;"> Illiteracy </td> <td style="text-align:right;"> -0.4803318 </td> <td style="text-align:right;"> 0.0693093 </td> </tr> <tr> <td style="text-align:left;"> LifeExp </td> <td style="text-align:right;"> 0.4685523 </td> <td style="text-align:right;"> -0.3243911 </td> </tr> <tr> <td style="text-align:left;"> Murder </td> <td style="text-align:right;"> -0.4594049 </td> <td style="text-align:right;"> 0.4916219 </td> </tr> <tr> <td style="text-align:left;"> HSGrad </td> <td style="text-align:right;"> 0.4669687 </td> <td style="text-align:right;"> 0.3363552 </td> </tr> </tbody> </table> - A high (low) weight indicates a strong positive (negative) association between a variable and the corresponding PC. --- # Biplot - The weight vectors can be plotted on the same scatterplot as the data.<!--D--> -- - This is called a biplot.<!--D--> -- - We can do several useful things with a biplot<!--D--> -- + See how the observations relate to one another + See how the variables relate to one another + See how the observations relate to the variables --- # Types of biplot - There are multiple ways to draw a biplot.<!--D--> -- - We will look at two versions<!--D--> -- + Distance Biplot + Correlation Biplot --- # Distance Biplot <img src="PCA_files/figure-html/dbiplot-1.png" style="display: block; margin: auto;" /> --- # Distance Biplot - The distance between observations implies similarity between observations<!--D--> -- + Louisiana (LA) and South Carolina (SC) are close therefore are similar. + Arkansas (AR) and California (CA) are far apart and therefore different.<!--D--> -- - If the variables are ignored this is identical to a scatter plot of principal components. --- # Correlation Biplot <img src="PCA_files/figure-html/cbiplot-1.png" style="display: block; margin: auto;" /> --- # Correlations <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Income </th> <th style="text-align:right;"> Illiteracy </th> <th style="text-align:right;"> LifeExp </th> <th style="text-align:right;"> Murder </th> <th style="text-align:right;"> HSGrad </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> -0.437 </td> <td style="text-align:right;"> 0.340 </td> <td style="text-align:right;"> -0.230 </td> <td style="text-align:right;"> 0.620 </td> </tr> <tr> <td style="text-align:left;"> Illiteracy </td> <td style="text-align:right;"> -0.437 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> -0.588 </td> <td style="text-align:right;"> 0.703 </td> <td style="text-align:right;"> -0.657 </td> </tr> <tr> <td style="text-align:left;"> LifeExp </td> <td style="text-align:right;"> 0.340 </td> <td style="text-align:right;"> -0.588 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> -0.781 </td> <td style="text-align:right;"> 0.582 </td> </tr> <tr> <td style="text-align:left;"> Murder </td> <td style="text-align:right;"> -0.230 </td> <td style="text-align:right;"> 0.703 </td> <td style="text-align:right;"> -0.781 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> -0.488 </td> </tr> <tr> <td style="text-align:left;"> HSGrad </td> <td style="text-align:right;"> 0.620 </td> <td style="text-align:right;"> -0.657 </td> <td style="text-align:right;"> 0.582 </td> <td style="text-align:right;"> -0.488 </td> <td style="text-align:right;"> 1.000 </td> </tr> </tbody> </table> --- # Correlation Biplot - The angles between variables tell us something about correlation (approximately) + Income and HSGrad are highly positively correlated. The angle between them is close to zero. + LifeExp and Income are close to uncorrelated. The angle between them is close 90 degrees. + Murder and LifeExp are highly negatively correlated. The angle between them is close 180 degrees. --- # More comparison - The biplot also allows us to compare observations to variables.<!--D--> -- - Think of the variables as axes.<!--D--> -- - Draw the shortest line from each point to the axis.<!--D--> -- - The position along that axis gives an approximation to the actual value of the variable for that observation. --- # Biplot <img src="PCA_files/figure-html/dbiplot2-1.png" style="display: block; margin: auto;" /> --- # More PCs - We can find a third PC, which has the highest variance, while being uncorrelated with PC1 and PC2.<!--D--> -- - We cannot visualise this with a biplot, but there are alternatives depending on the structure of the data.<!--D--> -- - Now a time series example where we consider 3 principal components. --- # A Time Series Example - The Stock and Watson dataset contains data on 109 macroeconomic variables in the following categories<!--D--> -- + Output + Prices + Labour + Finance<!--D--> -- - One cannot look at 109 time series plots to visualise general macroeconomic conditions.<!--D--> -- - However, one can look at time series plots of the principal components of these variables. --- # Plots of PCs <img src="PCA_files/figure-html/sw-1.png" style="display: block; margin: auto;" /> --- # All PCs - There are as many principal components as there are variables. - Together all `\(p\)` principal components explain all of the variation in all `\(p\)` original variables. `$$\sum_{j=1}^p \mbox{Var}(C_j)=\sum_{j=1}^p \mbox{Var}(Y_j)$$` - Where `\(C_j\)` is principal component `\(j\)` and `\(Y_j\)` is variable `\(j\)` --- #So why PCs - However a small number of principal components can often explain a large proportion of the variance<!--D--> -- + In the first example, 2 PCs explain 84% of the total variation of 5 variables.<!--D--> -- + In our second example, 3 PCs explain 35% of the total variation of 109 variables. --- # Summary - Principal components analysis is useful for<!--D--> -- + Creating a single index<!--D--> -- + Seeing how variables are associated with observations on a single biplot.<!--D--> -- + Visualising high-dimensional time series.<!--D--> -- - How do we do it? --- class: inverse, middle, center # Implementation of PCA --- # Restriction - Recall that the objective is to find an LC with a large variance. How could we ‘cheat’ ?<!--D--> -- + For a single variable `\(\mbox{Var}(wY) = w^2 \mbox{Var}(Y)\)` <!--D--> -- + The variance can be made large by choosing a huge value of `\(w\)`.<!--D--> -- - For this reason the following restriction (normalization) is used `$$w_1^2 + w_2^2 \ldots + w_p^2 = 1.$$` --- # Standardisation - A similar logic applies to the units that the variables are measured in.<!--D--> -- - In the states dataset, income varies from $3000 to $6000, life expectancy varies from 67 years to 73 years.<!--D--> -- + Which variable will probably have the larger variance?<!--D--> -- - Income likely to have a larger variance. --- # Different units - If income is measured in $ ’000s then it will vary from about 3 to 6 -- - If Life Expectancy in measured in days rather than years it will vary from about 24800 days to 26900 days<!--D--> -- + Which variable will have the larger variance now?<!--D--> -- - The weights can be influenced by the units of measurement. --- # Effect of standardisation <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Std </th> <th style="text-align:right;"> Unstd </th> <th style="text-align:right;"> DifUnits </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Income </td> <td style="text-align:right;"> 0.3473 </td> <td style="text-align:right;"> 1.0000 </td> <td style="text-align:right;"> 0.0004 </td> </tr> <tr> <td style="text-align:left;"> Illiteracy </td> <td style="text-align:right;"> -0.4803 </td> <td style="text-align:right;"> -0.0004 </td> <td style="text-align:right;"> -0.0007 </td> </tr> <tr> <td style="text-align:left;"> LifeExp </td> <td style="text-align:right;"> 0.4686 </td> <td style="text-align:right;"> 0.0007 </td> <td style="text-align:right;"> 0.9999 </td> </tr> <tr> <td style="text-align:left;"> Murder </td> <td style="text-align:right;"> -0.4594 </td> <td style="text-align:right;"> -0.0014 </td> <td style="text-align:right;"> -0.0059 </td> </tr> <tr> <td style="text-align:left;"> HSGrad </td> <td style="text-align:right;"> 0.4670 </td> <td style="text-align:right;"> 0.0081 </td> <td style="text-align:right;"> 0.0096 </td> </tr> </tbody> </table> --- # Standardise or not? - While the normalisation `\(w_1^2 + w_2^2+\ldots+ w_p^2 = 1\)` is always implemented in any software that does PCA, the decision to standardise is up to you.<!--D--> -- - If the variables are measured in the *same* units then + *No* need to standardise.<!--D--> -- - If the variables are measured in the *different* units then + *Standardise* the data. --- # Principal Components in R - There are several functions for doing Principal Components Analysis in R. We will use `prcomp`<!--D--> -- - We can scale in two ways<!--D--> -- + Scale the data using the function scale + Include the option `scale.=TRUE` when calling the function `prcomp`<!--D--> -- - Now we will do PCA on the states dataset using R --- # Principal Components in R ```r StateSE%>% select_if(is.numeric)%>% #Only use numeric variables prcomp(scale. = TRUE)->pca #Do pca summary(pca) #summary of information ``` ``` ## Importance of components: ## PC1 PC2 PC3 PC4 PC5 ## Standard deviation 1.7892 0.9686 0.6317 0.55561 0.39093 ## Proportion of Variance 0.6403 0.1876 0.0798 0.06174 0.03057 ## Cumulative Proportion 0.6403 0.8279 0.9077 0.96943 1.00000 ``` --- # Principal Components in R - The output of the `prcomp` function is a prcomp object. <!--D--> -- - It is a list that contains a lot of information. Of most interest are<!--D--> -- + The principal components which are stored in `x` + The weights which are stored in `rotation` --- # Biplot - The biplot can be produced by: ```r biplot(pca) ``` - To have the state abbreviations on the plot they need to be attached to the matrix `pca$x` ```r rownames(pca$x)<-pull(StateSE,StateAbb) biplot(pca) ``` - Try it! --- # Correlation biplot - By default `biplot` produces the distance biplot. - To produce the correlation biplot try ```r biplot(pca,scale = 0) ``` --- # Scree Plot - Another plot that is easy to create is the Scree plot.<!--D--> -- - Along the horizontal axis is the Principal Component.<!--D--> -- - Along the vertical axis is the variance corresponding to each Principal Component.<!--D--> -- - The Scree plot indicates how much each PC explains the total variance of the data.<!--D--> -- ```r screeplot(pca,type="lines") ``` --- # Scree Plot <img src="PCA_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Selecting the number of PCs - The Scree plot can be used to select the number of Principal Components.<!--D--> -- - Look for a part where the plot flattens out also called the elbow of the Scree Plot.<!--D--> -- - Another criterion used for standardised data is Kaiser’s Rule. The rule is to select all PCs with a variance greater than 1. --- # Number of PCs - The way PCs are selected depend on the nature of the analysis.<!--D--> -- - For a visualisation via the biplot, two PCs must be selected.<!--D--> -- - In this case check the proportion of variance explained by those PCs<!--D--> -- - The higher this number the more accurate the biplot --- # PCA and MDS - When the input distances to MDS are Euclidean MDS and PCA are equivalent. -- - The usual caveat applies that these may only be exactly identical if the MDS solution is rotated. -- - The same does not apply generally to PCA. The first PC is defined to maximise variance. --- # Interpreting PCs - Remember that Principal Components do nothing more than find uncorrelated linear combinations of the variables that explain variance. -- - Sometimes the nature of the data or analysis from a biplot might imply some sort of interpretation for the PCs. -- - These interpretations can be subjective so be cautious. --- # Towards Factor Analysis - For survey data it is often the case that multiple survey questions are measures of the same underlying factor.<!--D--> -- - For example, at the end of semester you evaluate this unit. <!--D--> -- - Typically you will be asked many questions.<!--D--> -- - This is no different from any other customer satisfaction survey --- # Underlying factors - Although you are asked many questions perhaps there are two underlying factors that drive<!--D--> -- + The quality of the course materials + The quality of the teaching staff<!--D--> -- - Perhaps the quality of assessment is a third factor.<!--D--> -- - For survey data, Scree plots and Kaiser's rule can be used to select the number of underlying factors. --- # To do - These issues will be investigated in the topic on *Factor Modelling* which has some similarites (but also some important distinctions) when compared to PCA. -- - Later on we will also look more deeply into PCA proving some important results. -- - For now the primary objective is to understand what PCA does and how to implement it in R.