class: center, middle, inverse, title-slide # Detecting anomalies in smart meter data via manifold learning ### Anastasios Panagiotelis ### University of Sydney --- # Smart Meter .pull-left[ - Smart meters measure electricity usage in a building. - Need to detect anomalous electricity usage. - Scalable to millions of buildings. ] .pull-right[ <img src="img/smartmeter.jpeg" width="265" /> ] --- # Smart Meter data - Three housholds in an Irish dataset <img src="img/fig3.png" width="658" /> --- # Smart Meter data - Interested in distribution <img src="img/fig3plus.png" width="658" /> --- # Visualisation <img src="img/all_hh_one.png" width="768" /> --- # Interpretation - Each dot represents the probability distribution of electricity usage for a single household. -- - The colors depend on the value of a kernel density estimate of the points. -- - The 'typical' household comes from the middle of the distribution. -- - The 'anomalous' households comes from edges of the distribution. -- - But how do we get that plot? --- #Outline - Manifold Learning -- + Multidimensional Scaling -- + Isomap algorithm -- - Statistical manifolds -- + Hellinger distance estimator -- + Why it is fast -- - Application -- + One household -- + All households --- class: center, middle, inverse # Dimension reduction --- # Idea - Data are multivariate, i.e. `\(\mathbf{x}_i\in\mathbb{R}^p\)` where `\(p\)` is large and `\(i=1,2,\dots,n\)`. -- + Consider `\(n\)` firms with `\(p\)` financial indicators -- - Construct new data `\(\mathbf{y}_i\in\mathbb{R}^d\)` where `\(d<<p\)`. -- - How might we do this? --- # Distances - Close observations in input space should be close in output space. -- - *Euclidean distance* is a measure of similarity. -- - Inputs: `\(\delta_{ij}=\sqrt{(\mathbf{x}_i-\mathbf{x}_j)'(\mathbf{x}_i-\mathbf{x}_j)}\)` -- - Outputs: `\(d_{ij}=\sqrt{(\mathbf{y}_i-\mathbf{y}_j)'(\mathbf{y}_i-\mathbf{y}_j)}\)` -- - Construct `\(\mathbf{y}\)` so that `\(d_{ij}\)` approximate `\(\delta_{ij}\)` --- # Criterion Minimise: `$$\sum_{i,j}(\delta^2_{ij}-d^2_{ij})$$` Solved by -- 1. Putting `\(\delta_{ij}\)` in a matrix `\(\boldsymbol{\Delta}\)` 2. Double centering `\(\boldsymbol{\Delta}\)` 3. Finding eigenvectors. --- # Non-Euclidean - What if input distances `\(\delta_{ij}\)` are not Euclidean? -- - Can put them in `\(\mathbf{\Delta}\)` anyway -- - Optimise a slightly different objective, but still get a good representation. -- - Output distances `\(d_{ij}\)` still approximate input distances `\(\delta_{ij}\)` --- # Manifold learning - What if data lie on some non-linear `\(d\)`-dimensional surface in `\(p\)`-dimensional space. -- - Think of the surface of a sphere (2-dimensional) in space (3-dimensional). -- - Euclidean distance is no longer appropriate -- - Should instead use the geodesic distance. -- - This is the aim of ISOMAP. --- # Toy example Consider the following data <img src="RiskWorkshop_files/figure-html/unnamed-chunk-5-1.png" height="450" style="display: block; margin: auto;" /> --- # Manifold Lying on a manifold. <img src="RiskWorkshop_files/figure-html/unnamed-chunk-6-1.png" height="450" style="display: block; margin: auto;" /> --- # Euclidean distance Blue points close in ambient space. <img src="RiskWorkshop_files/figure-html/unnamed-chunk-7-1.png" height="450" style="display: block; margin: auto;" /> --- # Geodesic Distance But not on manifold. <img src="RiskWorkshop_files/figure-html/unnamed-chunk-8-1.png" height="450" style="display: block; margin: auto;" /> --- # Input distances - Classical MDS would use the distance in blue as an input -- - The idea behind Isomap is to use the distance in green as an input. -- - However to compute the geodesic (green) distance we need to know the manifold. -- - Instead we approximate the geodesic distance --- # Geodesic Distance Geodesic can be approximated. <img src="RiskWorkshop_files/figure-html/unnamed-chunk-9-1.png" height="450" style="display: block; margin: auto;" /> --- # Geodesic Distance Try zooming.
--- # Approximate Geodesic - Find the graph of nearest neighbors -- + Neighbourhood within a `\(\epsilon\)`-ball + `\(K\)`-Nearest neighbours -- - As edge weights use the distance between neighbors -- - Approximate geodesic distance by finding shortest path --- # Example - ISOMAP is just one manifold learning algorithm. -- - Others include: + LLE + Laplacian Eigenmaps + t-SNE + ... -- - These algorithms (and many others) use the nearest neighbor graph. --- class: center, middle, inverse # Statistical Manifolds --- # Statistical manifolds - A statistical manifold is a manifold with elements that are **probability distributions** -- - Can no longer think of the inputs as being `\(p\)`-dimensional vectors -- - Can no longer think of Euclidean distance as being a distance between input points. -- - We can use other measures of distance. --- # Hellinger Distance - Distance between two distributions given by `$$H_{i,j}^2=\frac{1}{2}\int\left(\sqrt{p_i(z)}-\sqrt{p_j(z)}\right)^2dz$$` -- - Where `\(p_i(x)\)` and `\(p_j(x)\)` are densities corresponding to observation `\(i\)` and `\(j\)` respectively. -- - Need to estimate using data `\(z_{i,1},z_{i,2},\dots,z_{i,T_i}\sim p_i\)` and `\(z_{j,1},z_{j,2},\dots,z_{j,T_j}\sim p_j\)`. --- # Our estimator - Pool all the data `\(z_{i,1},z_{i,2},\dots z_{i,T_i}\forall i=1,2,\dots,n\)` together. - Find `\(L\)` equal probability partitions of the pooled data `\(I_l=(\gamma_{l-1},\gamma_{l-1}]\)` and let `$${\pi}_{i,l}=\sqrt{\frac{1}{2T_i}\sum_t I\left\{z_{i,t}\in I_l\right\}}$$` -- - Estimator is `$$\hat{H}_{i,j}=\sqrt{\sum_l\left(\pi_{i,l}-\pi_{j,l}\right)^2}$$` --- # Estimator - We prove this estimator is consistent as the number of observations `\(T_i\)` and number of partitions `\(L\)` goes to infinity. -- - This is not the only estimator of Hellinger distance. -- - What makes it special is that it 'looks like' a Euclidean distance. -- - This is important for **computational** reasons. --- # Nearest neighbour graph - Recall that manifold learning algorithms require finding a nearest neighbor graph. -- - A naive way to do this would be to compute all `\(N^2\)` pairwise distances. -- - A smarter way is to use an algorithm such as k-d trees. -- - This only requires computing `\(O(Nlog(N))\)` pairwise distances. -- - But it only works for a few distance metrics (one of which is Euclidean). --- # Approximate nearest neighbours - Further speed up is possible via **approximate** nearest neighbors. -- - May find an 'incorrect' nearest neighbor, but this is still guaranteed to be within `\(1+\epsilon\)` of the true nearest neighbor. -- - We test for robustness of results to using ANN. -- - We also use a more recent algorithm known as ANNOY --- class: middle, center, inverse # Application --- # Single household - Consider a single household. - Each 'observation' is the distribution corresponding to an hour of week, over many weeks. - Here `\(N=336\)` and `\(T\approx535\)` - Use ISOMAP with exact nearest neighbors, ANN and ANNOY --- # Results <img src="img/fig6_tod.png" width="1600" height="450" style="display: block; margin: auto;" /> --- # Results <img src="img/fig6.png" width="4000" height="450" style="display: block; margin: auto;" /> --- # Typical and anomalous hours <img src="img/fig7.png" width="6667" height="450" style="display: block; margin: auto;" /> --- # Speed <table class="table lightable-paper" style='margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Isomap </th> <th style="text-align:right;"> LLE </th> <th style="text-align:right;"> Laplacian Eigenmaps </th> <th style="text-align:right;"> Hessian LLE </th> <th style="text-align:right;"> t-SNE </th> <th style="text-align:right;"> UMAP </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 5cm; "> Exact NN with brute-force </td> <td style="text-align:right;font-weight: bold;"> 2.760 </td> <td style="text-align:right;font-weight: bold;"> 5.134 </td> <td style="text-align:right;font-weight: bold;width: 2cm; "> 2.944 </td> <td style="text-align:right;font-weight: bold;"> 2.087 </td> <td style="text-align:right;font-weight: bold;"> 1.918 </td> <td style="text-align:right;font-weight: bold;"> 2.367 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> Exact NN with k-d trees </td> <td style="text-align:right;"> 2.360 </td> <td style="text-align:right;"> 2.609 </td> <td style="text-align:right;width: 2cm; "> 2.295 </td> <td style="text-align:right;"> 1.669 </td> <td style="text-align:right;"> 1.596 </td> <td style="text-align:right;"> 2.089 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> ANN k-d trees </td> <td style="text-align:right;"> 2.351 </td> <td style="text-align:right;"> 2.986 </td> <td style="text-align:right;width: 2cm; "> 2.302 </td> <td style="text-align:right;"> 1.663 </td> <td style="text-align:right;"> 1.601 </td> <td style="text-align:right;"> 2.048 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> ANN Annoy </td> <td style="text-align:right;"> 1.942 </td> <td style="text-align:right;"> 3.093 </td> <td style="text-align:right;width: 2cm; "> 2.378 </td> <td style="text-align:right;"> 1.695 </td> <td style="text-align:right;"> 1.647 </td> <td style="text-align:right;"> 2.125 </td> </tr> </tbody> </table> --- # Accuracy <table class="table lightable-paper" style='margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Isomap </th> <th style="text-align:right;"> LLE </th> <th style="text-align:right;"> Laplacian Eigenmaps </th> <th style="text-align:right;"> Hessian LLE </th> <th style="text-align:right;"> t-SNE </th> <th style="text-align:right;"> UMAP </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Exact NN </td> <td style="text-align:right;font-weight: bold;"> 0.943 </td> <td style="text-align:right;"> 0.709 </td> <td style="text-align:right;"> 0.885 </td> <td style="text-align:right;"> 0.882 </td> <td style="text-align:right;"> 0.935 </td> <td style="text-align:right;"> 0.924 </td> </tr> <tr> <td style="text-align:left;"> ANN k-d trees </td> <td style="text-align:right;"> 0.942 </td> <td style="text-align:right;"> 0.718 </td> <td style="text-align:right;font-weight: bold;"> 0.891 </td> <td style="text-align:right;font-weight: bold;"> 0.885 </td> <td style="text-align:right;"> 0.902 </td> <td style="text-align:right;font-weight: bold;"> 0.933 </td> </tr> <tr> <td style="text-align:left;"> ANN Annoy </td> <td style="text-align:right;"> 0.938 </td> <td style="text-align:right;font-weight: bold;"> 0.746 </td> <td style="text-align:right;"> 0.886 </td> <td style="text-align:right;"> 0.873 </td> <td style="text-align:right;font-weight: bold;"> 0.939 </td> <td style="text-align:right;"> 0.929 </td> </tr> </tbody> </table> --- # All households - Consider all household. - Each 'observation' is the distribution corresponding to an household. - Here `\(N=3639\)` and `\(T\approx 179760\)` - Use ISOMAP with exact nearest neighbors, ANN and ANNOY --- # Results <img src="img/all_hh.png" width="960" height="450" style="display: block; margin: auto;" /> --- # Speed <table class="table lightable-paper" style='margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Isomap </th> <th style="text-align:right;"> LLE </th> <th style="text-align:right;"> Laplacian Eigenmaps </th> <th style="text-align:right;"> Hessian LLE </th> <th style="text-align:right;"> t-SNE </th> <th style="text-align:right;"> UMAP </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 5cm; "> Exact NN with brute-force </td> <td style="text-align:right;font-weight: bold;"> 4773.4 </td> <td style="text-align:right;font-weight: bold;"> 4967.4 </td> <td style="text-align:right;font-weight: bold;width: 2cm; "> 4725.2 </td> <td style="text-align:right;font-weight: bold;"> 5930.4 </td> <td style="text-align:right;font-weight: bold;"> 4739.6 </td> <td style="text-align:right;"> 748.6 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> Exact NN with k-d trees </td> <td style="text-align:right;"> 755.8 </td> <td style="text-align:right;"> 1038.2 </td> <td style="text-align:right;width: 2cm; "> 749.6 </td> <td style="text-align:right;"> 1204.1 </td> <td style="text-align:right;"> 747.7 </td> <td style="text-align:right;"> 744.3 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> ANN k-d trees </td> <td style="text-align:right;"> 755.7 </td> <td style="text-align:right;"> 1041.0 </td> <td style="text-align:right;width: 2cm; "> 751.5 </td> <td style="text-align:right;"> 1208.7 </td> <td style="text-align:right;"> 747.8 </td> <td style="text-align:right;"> 741.7 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> ANN Annoy </td> <td style="text-align:right;"> 1670.2 </td> <td style="text-align:right;"> 1947.3 </td> <td style="text-align:right;width: 2cm; "> 1656.1 </td> <td style="text-align:right;"> 2363.7 </td> <td style="text-align:right;"> 1557.1 </td> <td style="text-align:right;font-weight: bold;"> 1675.3 </td> </tr> </tbody> </table> --- # Accuracy <table class="table lightable-paper" style='margin-left: auto; margin-right: auto; font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:right;"> Method </th> <th style="text-align:right;"> Isomap </th> <th style="text-align:right;"> LLE </th> <th style="text-align:right;"> Laplacian Eigenmaps </th> <th style="text-align:right;"> Hessian LLE </th> <th style="text-align:right;"> t-SNE </th> <th style="text-align:right;"> UMAP </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> Exact NN </td> <td style="text-align:right;"> 0.926 </td> <td style="text-align:right;"> 0.911 </td> <td style="text-align:right;"> 0.880 </td> <td style="text-align:right;font-weight: bold;"> 0.579 </td> <td style="text-align:right;"> 0.914 </td> <td style="text-align:right;"> 0.943 </td> </tr> <tr> <td style="text-align:right;"> ANN k-d trees </td> <td style="text-align:right;"> 0.926 </td> <td style="text-align:right;"> 0.911 </td> <td style="text-align:right;"> 0.880 </td> <td style="text-align:right;font-weight: bold;"> 0.579 </td> <td style="text-align:right;font-weight: bold;"> 0.919 </td> <td style="text-align:right;"> 0.943 </td> </tr> <tr> <td style="text-align:right;"> ANN Annoy </td> <td style="text-align:right;font-weight: bold;"> 0.926 </td> <td style="text-align:right;font-weight: bold;"> 0.911 </td> <td style="text-align:right;font-weight: bold;"> 0.880 </td> <td style="text-align:right;"> 0.578 </td> <td style="text-align:right;"> 0.907 </td> <td style="text-align:right;font-weight: bold;"> 0.943 </td> </tr> </tbody> </table> --- # Conclusions - We have a way of visualising and finding anomalies for *statistical manifolds* -- - Relies on a consistent estimator of Hellinger distance -- - Has computational advantages that exploit Nearest Neighbor algorithms -- - Further computational improvements possible with ANN and there is little tradeoff on accuracy. --- class: middle, center, inverse # Questions?