class: center, middle, inverse, title-slide # Updating Variational Bayes ## with Application to Predicting Car Position ### Anastasios Panagiotelis ### February 2019 --- class: inverse, center, middle # Context --- # Real time decisions - Motivating examples: -- + High frequency finance. + Smart electricity grids. + Self driving car. -- - Data is observed frequently. -- - Decisions are based on a statistical model. -- - Decisions must be made quickly and repeatedly. --- # Observe first block of data
- Observe data from `\(t=1\)` to `\(T_1\)`, i.e., `\(\boldsymbol{y}_{1:T_1}\)`. --- # Make first decision
- This decision should be based on `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})\)` --- #Observe second block of data
- Observe data from `\(t=T_1+1\)` to `\(T_2\)`, i.e., `\(\boldsymbol{y}_{T_1+1:T_2}\)`. --- # Make second decision
- Decision 2 should be based on `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\)`. - Use `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})\)` to avoid reprocessing `\(\boldsymbol{y}_{1:T_1}\)`. --- # Bayesian updating - The Bayesian framework does this naturally -- `$$\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}\propto p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$` -- - After observing `\(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}\)` -- `$$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\propto p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}$$` -- - Is this practical? --- class:inverse, middle, center # Patience is NOT a virtue! --- # Bayesian Inference - Apart from simple models, Bayesian inference is computationally intensive: -- + MCMC + HMC + RHMC + SMC + PMCMC + PMMH -- - The alphabet soup goes on... --- # Variational Bayes - A faster alternative is **Variational Bayes (VB)**. -- - VB emerged from *machine learning*. -- - Approximate posterior `\(p({\boldsymbol{\theta}}|{\boldsymbol y})\)` by `\(q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})\)`. -- - The vector `\({\boldsymbol{\lambda}}\)` contains auxiliary parameters of the approximation. -- - For multivariate Gaussian `\(q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})\)`, `\(\boldsymbol{\lambda}\)` contains means, variances and covariances. -- - We consider finite mixtures of multivariate Gaussians. <!-- --- --> <!-- # In detail --> <!-- - The objective is to minimise the KL divergence between `\(p\)` and `\(q\)` --> <!-- `$$\underset{\boldsymbol{\lambda}}{\min}\int\limits_{\boldsymbol\theta}\ln\left(\frac{q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})}{p({\boldsymbol{\theta}}|{\boldsymbol y})}\right)q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})d\boldsymbol{\theta}$$` --> <!-- -- --> <!-- - Bayesian inference recast as an optimisation problem. --> --- # The ELBO - Objective function: Maximise ELBO. `$$\mathcal{L}=\int\limits_{\boldsymbol\theta}\ln\left(\frac{p({\boldsymbol{\theta}},{\boldsymbol y})}{q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})}\right)q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})d\boldsymbol{\theta}$$` -- - Equivalent to minimising KL divergence between `\(q\)` and posterior. -- - The term `\(p(\boldsymbol{\theta},\boldsymbol{y})\)` is the product of the prior and likelihood. -- - Bayesian inference via optimisation. --- <!-- # How to do it? --> <!-- - For simple models --> <!-- -- --> <!-- + like the exponential class, --> <!-- -- --> <!-- - and simple approximations --> <!-- -- --> <!-- + that assume independence, --> <!-- -- --> <!-- - we can use coordinate ascent. --> <!-- -- --> <!-- What if the approximation `\(q_{\boldsymbol{\lambda}}(\boldsymbol{\theta})\)` is something flexible such as a mixture of multivariate normals? --> <!-- --- --> # Stochastic Gradient Ascent - Maximise the ELBO using **Stochastic Gradient Ascent (SGA)**. The same technique is used to train deep neural nets. -- `$${\boldsymbol\lambda}^{(m+1)} = {\boldsymbol\lambda}^{(m)} + \boldsymbol{\rho}^{(m)}\cdot \widehat{\frac{\partial\mathcal{L}}{\partial {\boldsymbol\lambda}}} \bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}$$` -- - Learning rate `\(\boldsymbol{\rho}^{(m)}\)` determined by ADAM method (Kingma and Ba, 2014). -- - Abbreviate `\(\widehat{\frac{\partial\mathcal{L}}{\partial {\boldsymbol\lambda}}} \bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}\)` as `\(\nabla_{\boldsymbol\lambda^{(m)}}\)` hereafter. --- # An important distinction - For deep learning, the gradient can often, in principle, be evaluated exactly. -- - However, these applications can involve an enormous number of training observations. -- - It is then only feasible to approximate the gradient using minibatches of the data. -- - In contrast, for VB the gradient is stochastic even when all data is used. -- - The gradient involves an integral that must be evaluated numerically. --- # Gradient The gradient of the ELBO is given by `$$\nabla_{{\boldsymbol\lambda}^{(m)}}=\int{\class{emphcol}{\Psi(\boldsymbol{\theta})}}{q_{{\boldsymbol\lambda}^{(m)}}(\boldsymbol{\theta})d{\boldsymbol{\theta}}}$$` -- where `$${\class{emphcol}{\Psi(\boldsymbol{\theta})}}=\frac{\partial\log q_{\boldsymbol\lambda}(\boldsymbol{\theta})}{\partial{\boldsymbol{\lambda}}}\bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}\left(p(\boldsymbol{\theta},\boldsymbol{y})-q_{\boldsymbol\lambda^{(m)}}(\boldsymbol{\theta})\right)$$` --- # Monte Carlo approximation The Gradient `$$\nabla_{{\boldsymbol\lambda}^{(m)}}={\class{emphcolb}\int}\Psi(\boldsymbol{\theta}){\class{emphcolb}{q_{\boldsymbol\lambda^{(m)}}(\boldsymbol{\theta})d{\boldsymbol{\theta}}}}$$` -- can be approximated by drawing `\(\boldsymbol{\theta}^{[1]},\ldots\boldsymbol{\theta}^{[S]}\sim q_{\boldsymbol{\lambda}^{(m)}}\)` and computing `$$\nabla_{{\boldsymbol\lambda}^{(m)}}\approx{\class{emphcolb}{\frac{1}{S}\sum\limits_{s=1}^S}}\Psi(\boldsymbol{\theta}^{[s]})$$` --- class: center, middle, inverse # Updating Variational Bayes --- # Bayesian updating - Recall Bayesian updating -- `$$\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}\propto p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$` -- - After observing `\(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}\)` -- `$$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\propto p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}$$` -- - How to adapt this for Variational Inference? --- # Updating Variational Bayes - Replace `\(p\)`'s with `\(q\)`'s. -- `$$\class{emphcolb}{q_{\boldsymbol{\lambda}_1}}(\boldsymbol{\theta}){\underset{\sim}{\propto}} p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$` -- - After observing `\(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}\)` -- `$${q_{\boldsymbol{\lambda}_2}}(\boldsymbol{\theta}){\underset{\sim}{\propto}} p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{q_{\boldsymbol{\lambda}_1}}(\boldsymbol{\theta})$$` -- - This is Updating Variational Bayes (UVB). -- - For UVB likelihood involves `\(T_{i+1}-T_{i}\)` terms - For Standard VB (SVB) it involves `\(T_i\)` terms. --- # Why is it faster? - Can start UVB before SVB has even finished collecting the data.
--- <!-- # Standard VB --> <!-- - Wait to collect all data, `\(y_{1:T_2}\)`. --> <!-- -- --> <!-- - Choose initial `\(\boldsymbol{\lambda}^{(0)}\)` and draw `\(\boldsymbol{\theta}\sim q_{\boldsymbol{\lambda}^{(0)}}\)`. --> <!-- -- --> <!-- - Approximate gradient (likelihood in `\(\Psi\)` involves `\(T_2\)` terms). --> <!-- -- --> <!-- - Update to `\(\boldsymbol{\lambda}^{(1)}\)` and draw `\(\boldsymbol{\theta}\sim q_{\boldsymbol{\lambda}^{(1)}}\)`. --> <!-- -- --> <!-- - Repeat until convergence gives `\(\boldsymbol{\lambda}^*\)`. --> <!-- -- --> <!-- - The approximation to `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\)` is `\(q_{\boldsymbol{\lambda}^*}\)`. --> <!-- --- --> <!-- # Updating Variational Bayes --> <!-- - Collect first block of data `\(\boldsymbol{y}_{1:T_1}\)` --> <!-- -- --> <!-- - Approximate `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})\)` by `\(q_{\boldsymbol{\lambda}^*_1}\)` using VB. --> <!-- -- --> <!-- - Collect second block of data `\(\boldsymbol{y}_{T_1+1:T_2}\)` --> <!-- -- --> <!-- - Use `\(\boldsymbol{\lambda}^*_1\)` as an initial value for `\(\boldsymbol{\lambda}^{(0)}_2\)`. --> <!-- -- --> <!-- - Approximate gradient (likelihood in `\(\Psi\)` involves `\(T_2-T_1\)` terms) --> <!-- -- --> <!-- - Update to `\(\boldsymbol{\lambda}_2^{(1)}\)` and draw `\(\boldsymbol{\theta}\sim q_{\boldsymbol{\lambda}_2^{(1)}}\)`. --> <!-- -- --> <!-- - Repeat until convergence gives `\(\boldsymbol{\lambda}_2^*\)`. --> <!-- -- --> <!-- - The approximation to `\(p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\)` is `\(q_{\boldsymbol{\lambda}_2^*}\)` --> <!-- --- --> # Updating Variational Bayes -- .center[![](img/UVB.gif)] --- # Standard Variational Bayes -- .center[![](img/VB.gif)] --- # Markov chain Monte Carlo -- .center[![](img/MCMC.gif)] --- # An even faster way - At time `\(T_i\)` and step `\(m\)` of SGA -- - Draw values of `\(\boldsymbol{\theta}\)` from `\(q_{\boldsymbol{\lambda}^{(m)}_{i}}\)`. -- - Compute `\(\Psi(.)\)` at each of these draws to update to `\(\boldsymbol{\lambda}^{(m+1)}_{i}\)`. -- - We propose an algorithm that does not repeat these steps as `\(m\)` is incremented. -- - Use `\(q_{\boldsymbol{\lambda}^*_{i-1}}\)` as an importance density for entire SGA at time `\(T_i\)`. --- # Importance Sampling Monte Carlo: Draw `\(\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} q_{\boldsymbol{\lambda}^{(m)}_i}\)` and -- `$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \Psi(\boldsymbol{\theta}^{[j]})$$` -- Importance Sampling: Draw `\(\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} q_{\boldsymbol{\lambda}^*_{i-1}}\)` -- `$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \class{emphcolb}{\frac{q_{\boldsymbol{\lambda}^{*}_{i-1}}(\boldsymbol{\theta}^{[j]})}{q_{\boldsymbol{\lambda}^{(m)}_{i}}(\boldsymbol{\theta}^{[j]})}}\Psi(\boldsymbol{\theta}^{[j]})$$` --- #Importance Sampling Monte Carlo: Draw `\(\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} \class{emphcol}{q_{\boldsymbol{\lambda}^{(m)}_i}}\)` and `$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \Psi(\boldsymbol{\theta}^{[j]})$$` Importance Sampling: Draw `\(\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} \class{emphcol}{q_{\boldsymbol{\lambda}^*_{i-1}}}\)` `$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \frac{q_{\boldsymbol{\lambda}^{*}_{i-1}}(\boldsymbol{\theta}^{[j]})}{q_{\boldsymbol{\lambda}^{(m)}_i}(\boldsymbol{\theta}^{[j]})}\Psi(\boldsymbol{\theta}^{[j]})$$` <!-- --- --> <!-- # Why does this work? --> <!-- - For UVB, each time `\(\boldsymbol{\lambda_2}\)` is updated a new sample of `\(\boldsymbol{\theta}\)` must be generated. --> <!-- -- --> <!-- - Also `\(\Psi\)` (which involves a likelihood calculation) must be recomputed after each update of `\(\boldsymbol{\lambda_2}\)`. --> <!-- -- --> <!-- - For UVB with Importance sampling (UVB-IS) a sample of `\(\boldsymbol{\theta}\)` is drawn and `\(\Psi\)` computed only ONCE per block of data. --> --- # UVB-IS <img src="img/UVBIS.png" width="405" style="display: block; margin: auto;" /> --- # Compared to Sayaka and Klami <img src="img/sayaka.png" width="340" style="display: block; margin: auto;" /> --- # Updating Variational Bayes -- .center[![](img/UVB.gif)] --- # Updating Variational Bayes (IS) -- .center[![](img/UVBIS.gif)] --- # Speed at the cost of accuracy -- .center[![](img/Crash.gif)] --- class:: center, middle, inverse # Evaluation --- # Simulation 1 - Stationary AR(3) model. -- - One-step ahead forecasts need to be made at `\(t=100,125,150,\ldots,500\)`. -- - Forecast accuracy measured by log score. -- - Conduct 500 replications. -- - For `\(q\)` consider a multivariate mixture of normals with `\(K=1,2,3\)` components. --- # Simulation 2 $$ y_{it}|k_i=j\sim N(\mu_j,\sigma^2_j) $$ - Number of time series observations and cross-sectional observations both 100 `\(t=1,\ldots,100\)`, `\(i=1,100\)` -- - Each unit comes from one of two distributions `\(k_i\in\left\{1,2\right\}\forall i\)`. -- - Aim is classify each unit into the correct component at `\(t=10,20,30,\ldots,100\)`. -- - Accuracy measured by proportion of correct classifications. --- # Simulation 2: Approximation - Continuous parameters `\(\boldsymbol{\theta}\)` and discrete parameters `\(k_{i}\)`. -- - Marginalise out discrete parameters (component indicators) `$$p(\boldsymbol{\theta,\boldsymbol{y}})=\prod\limits^n_{i=1}\class{emphcolb}{\sum\limits_{j=1,2}}\prod\limits_{t=1}^T\phi(y_{it};\mu_j,\sigma^2_j)\text{Pr}(k_i=j)p(\boldsymbol{\theta})$$` -- - Approximate by `\(q(\boldsymbol{\theta})\)`, and use to obtain estimates of `\(\mbox{Pr}(k_i=j|\boldsymbol{y})\)`. --- # Relative cumulative run time - For SVB, the time taken to conduct inference at time `\(T_i\)`. -- - For UVB, the sum of time taken to conduct inference at `\(T_1,T_2,\ldots,T_i\)`. -- - This is unfair to our methods. -- - All times are reported relative to SVB at time `\(T_1\)` (all methods share this as a first step). --- # Time comparison (at `\(T_2\)`)
--- # Time comparison (at `\(T_3\)`)
--- # Simulation 1: Timing <img src="img/AR3timing.png" width="384" style="display: block; margin: auto;" /> --- # Simulation 1: Accuracy <img src="img/AR3cls.png" width="384" style="display: block; margin: auto;" /> --- # Simulation 2: Timing <img src="img/mixNormTiming.png" width="384" style="display: block; margin: auto;" /> --- # Simulation 2: Accuracy <img src="img/mixNormScore.png" width="384" style="display: block; margin: auto;" /> --- class: inverse, middle center # Predicting Car Position --- # Car Position - Predict lateral deviation from centre of lane, every 100 milliseconds. -- - Train model on `\(500\)` vehicles and predict another `\(500\)` vehicles. -- - Assume similar model to Simulation 2. -- - Dirichlet process prior on the parameters of each component. -- - Number of components is data driven. --- # Data preprocessing - Geometry of the road removed by a spline model trained on different set of cars. <img src="img/cars_roadStraightening.png" width="512" style="display: block; margin: auto;" /> --- # Clusters <img src="img/DPPgroups_2.png" width="640" style="display: block; margin: auto;" /><img src="img/DPPoverall_2.png" width="640" style="display: block; margin: auto;" /> --- # Forecast accuracy - Benchmark model is to assume all cars are independent. Cannot use the information learned about clusters. -- - Then consider the DPP model with three different inferential methods -- + MFVB (assume independence in `\(q\)`) + SVB with 3-component normal mixture + UVB-IS with 3-component normal mixture --- # Results <img src="img/cumulative log scores.png" width="640" style="display: block; margin: auto;" /> --- # Conclusions - The independent model performs poorly. Need to *borrow strength* from other cars. -- - The mean field approximation also performs poorly. Need dependence in the posterior. -- - The results for SVB and UVB-IS sit almost on top of one another. --- # Conclusions - Evidence suggest that updating can be used with VB. -- - The gains in computational time can be substantial. -- - The trade-off in terms of accuracy is generally small. -- - The exact nature of this trade-off will depend on the model and the purposes of the analysis.