class: center, middle, inverse, title-slide .title[ # Dimension Reductiont SNE ] .author[ ### Anastasios Panagiotelis ] .institute[ ### University of Sydney ] --- #Outline - Stochastic Neighbourhood Embedding(SNE) -- - Why t-SNE? -- - High dimensional space - Outliers -- - Applications -- - Reference is Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. *Journal of Machine Learning Research*, **9(11)**. --- class: center, middle, inverse # SNE --- # Jumping - Consider that you are currently at point `\(i\)` -- - You would like to 'jump' to another point. -- - While the jump is random, you should jump to nearby points with higher probability. --- # A Distribution - Let probability of a jump from point `\(i\)` to point `\(j\)` be given by -- `$$q_{j|i}\propto exp(-||\pmb{x}_i-\pmb{x}_j||/2\sigma_i)$$` -- - Since there are a discrete number of points these probabilities can be normalised to sum to one. --- # Determining `\(\sigma_i\)` Find `\(\sigma_i\)` for each `\(i\)` so that the perplexity `$$2^{\sum_{j\neq i}-p_{j|i}\log_2p_{j|i}}$$` is fixed at a value (usually between 5 and 50) -- This allows `\(\sigma\)` to vary with the density of points around `\(\pmb{x}_i\)`. --- # Output space - Consider a low dimensional representation `\(\pmb{y}_i\)` of point `\(i\)` -- - A jumping distribution can be defined in a similar way -- `$$q_{j|i}\propto exp(-||\pmb{y}_i-\pmb{y}_j||)$$` -- Note there is no `\(\sigma_i\)` here. --- # A Good Representation - Representation is accurate if `\(p_{j|i}\)` are similar to `\(q_{j|i}\)` - Find `\(\pmb{y}_i\)` to minimise KL divergence between input and output 'jumping' distributions -- `$$\min \sum_i\sum_{j\neq i}p_{j|i}\ln\left(\frac{p_{j|i}}{q_{j|i}}\right)$$` --- class: middle, inverse, center # Problems with SNE --- # Outliers - For an outlier, `\(p_{j|i}\)` will be small for all `\(j\)`. -- - This means that there is little contribution from outliers to the cost function -- - The position of `\(\pmb{y}_i\)` is therefore not well defined if `\(\pmb{x}_i\)` is an outlier. -- - Numerical instability in optimising KL divergence (usually by gradient descent) can occur --- # Solution Consider a symmetric version of the problem where `$$p_{ij}=\frac{p_{i|j}+p_{j|i}}{2n}$$` And then optimise a single KL divergence over pairs of points `$$\min\sum\limits_{i,j}p_{ij}\ln\left(\frac{p_{ij}}{q_{ij}}\right)$$` This solves the problem with outliers. --- # How big is 1D space? <img src="10tSNE_files/figure-html/unnamed-chunk-1-1.png" height="450" style="display: block; margin: auto;" /> Black area 4 times larger than orange area --- # How big is 2D space <img src="10tSNE_files/figure-html/unnamed-chunk-2-1.png" height="450" style="display: block; margin: auto;" /> Black area 1.778 times larger than orange area --- # How big is 3D space
Orange and black areas roughly equal --- # Consequences - In all above plots all black/orange points were within a certain radius of the center. - As the dimension grows there is relatively more **volume** in the orange region. - When reducing dimension points in a large volume need to be mapped to a smaller volume. - This causes points to collapse together. --- # Fixing this issue - Moderately distant input points need to be mapped to more distant output points - One way to achieve this is to use a different function for jump probabilities in the output space only -- `$$q_{j|i}=\propto(1+||\pmb{y}_i-\pmb{y}_j||)^{-1}$$` -- This is the kernel of the t-distribution with one degree of freedom. Hence the name t-SNE --- # Explanation - Moderately close input points need to be mapped to points that are far apart in the output space - One way to get the `\(p_{ij}\)` for such points closer to the corresponding `\(q_{ij}\)` is to assume a heavier tailed kernel. - This avoids points collapsing together. --- class: middle, inverse, center #Implementation --- # Swiss Roll ```r datsr <- loadDataSet("Swiss Roll") plot(datsr,type='3varsrgl') ```
--- # tSNE ```r tsneout <- embed(datsr, "tSNE") plot(tsneout, type = "2vars") ``` <img src="10tSNE_files/figure-html/unnamed-chunk-6-1.png" height="450" style="display: block; margin: auto;" /> --- # Perplexity = 10 ```r tsneout <- embed(datsr, "tSNE", perplexity=10) plot(tsneout, type = "2vars") ``` <img src="10tSNE_files/figure-html/unnamed-chunk-7-1.png" height="450" style="display: block; margin: auto;" /> --- # Perplexity = 100 ```r tsneout <- embed(datsr, "tSNE", perplexity=100) plot(tsneout, type = "2vars") ``` <img src="10tSNE_files/figure-html/unnamed-chunk-8-1.png" height="450" style="display: block; margin: auto;" /> --- # Pros and Cons - Advantages + Fast + Robust -- - Disadvantages + Creates clusters even where they do not exist. + Generally introduces distortion