Updating Variational Bayes

class: center, middle, inverse, title-slide

# Updating Variational Bayes
## with Application to Predicting Car Position
### Anastasios Panagiotelis
### February 2019

---

class: inverse, center, middle

# Context

---

# Real time decisions

- Motivating examples:
--

+ High frequency finance.
    + Smart electricity grids.
    + Self driving car.
--
- Data is observed frequently.
--

- Decisions are based on a statistical model.
--

- Decisions must be made quickly and repeatedly.

---

# Observe first block of data

<div id="htmlwidget-3554b450ac1497790c1b" style="width:100%;height:100%;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-3554b450ac1497790c1b">{"x":{"url":"UVB_files/figure-html//widgets/widget_unnamed-chunk-1.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

- Observe data from `$t=1$` to `$T_1$`, i.e., `$\boldsymbol{y}_{1:T_1}$`.

---
# Make first decision

<div id="htmlwidget-c0fa6f15dfd08028bef2" style="width:100%;height:100%;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-c0fa6f15dfd08028bef2">{"x":{"url":"UVB_files/figure-html//widgets/widget_unnamed-chunk-2.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

- This decision should be based on `$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})$`

---

#Observe second block of data

<div id="htmlwidget-8425e1b3fdffc817ea6c" style="width:100%;height:100%;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-8425e1b3fdffc817ea6c">{"x":{"url":"UVB_files/figure-html//widgets/widget_unnamed-chunk-3.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

- Observe data from `$t=T_1+1$` to `$T_2$`, i.e., `$\boldsymbol{y}_{T_1+1:T_2}$`.

---

# Make second decision

<div id="htmlwidget-0788932cf9ae998f1416" style="width:100%;height:100%;" class="widgetframe html-widget"></div>
<script type="application/json" data-for="htmlwidget-0788932cf9ae998f1416">{"x":{"url":"UVB_files/figure-html//widgets/widget_unnamed-chunk-4.html","options":{"xdomain":"*","allowfullscreen":false,"lazyload":false}},"evals":[],"jsHooks":[]}</script>

- Decision 2 should be based on `$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})$`. 
- Use `$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})$` to avoid reprocessing `$\boldsymbol{y}_{1:T_1}$`.

---

# Bayesian updating

- The Bayesian framework does this naturally
--
`$$\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}\propto p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$`
--

- After observing `$\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}$`
--

`$$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\propto p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}$$`
--

- Is this practical?

---
class:inverse, middle, center

# Patience is NOT a virtue!

---

# Bayesian Inference

- Apart from simple models, Bayesian inference is computationally intensive:
--

+ MCMC
    + HMC
    + RHMC
    + SMC
    + PMCMC
    + PMMH
--
- The alphabet soup goes on...

---

# Variational Bayes

- A faster alternative is **Variational Bayes (VB)**.
--

- VB emerged from *machine learning*.
--

- Approximate posterior `$p({\boldsymbol{\theta}}|{\boldsymbol y})$` by `$q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})$`.
--

- The vector `${\boldsymbol{\lambda}}$` contains auxiliary parameters of the approximation.
--

- For multivariate Gaussian `$q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})$`, `$\boldsymbol{\lambda}$` contains means, variances and covariances.
--

- We consider finite mixtures of multivariate Gaussians.

---

# The ELBO

- Objective function: Maximise ELBO.

`$$\mathcal{L}=\int\limits_{\boldsymbol\theta}\ln\left(\frac{p({\boldsymbol{\theta}},{\boldsymbol y})}{q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})}\right)q_{\boldsymbol{\lambda}}({\boldsymbol{\theta}})d\boldsymbol{\theta}$$`
--

- Equivalent to minimising KL divergence between `$q$` and posterior.
--

- The term `$p(\boldsymbol{\theta},\boldsymbol{y})$` is the product of the prior and likelihood.
--

- Bayesian inference via optimisation.

---

# Stochastic Gradient Ascent

- Maximise the ELBO using **Stochastic Gradient Ascent (SGA)**. The same technique is used to train deep neural nets.
--

`$${\boldsymbol\lambda}^{(m+1)} = {\boldsymbol\lambda}^{(m)} + \boldsymbol{\rho}^{(m)}\cdot \widehat{\frac{\partial\mathcal{L}}{\partial {\boldsymbol\lambda}}} \bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}$$`
--

- Learning rate `$\boldsymbol{\rho}^{(m)}$` determined by ADAM method (Kingma and Ba, 2014).
--

- Abbreviate `$\widehat{\frac{\partial\mathcal{L}}{\partial {\boldsymbol\lambda}}} \bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}$` as `$\nabla_{\boldsymbol\lambda^{(m)}}$` hereafter.

---

# An important distinction

- For deep learning, the gradient can often, in principle, be evaluated exactly.
--

- However, these applications can involve an enormous number of training observations.
--

- It is then only feasible to approximate the gradient using minibatches of the data. 
--

- In contrast, for VB the gradient is stochastic even when all data is used.
--

- The gradient involves an integral that must be evaluated numerically.
---

# Gradient

The gradient of the ELBO is given by

`$$\nabla_{{\boldsymbol\lambda}^{(m)}}=\int{\class{emphcol}{\Psi(\boldsymbol{\theta})}}{q_{{\boldsymbol\lambda}^{(m)}}(\boldsymbol{\theta})d{\boldsymbol{\theta}}}$$`
--

where

`$${\class{emphcol}{\Psi(\boldsymbol{\theta})}}=\frac{\partial\log q_{\boldsymbol\lambda}(\boldsymbol{\theta})}{\partial{\boldsymbol{\lambda}}}\bigg\rvert_{{\boldsymbol\lambda} = {\boldsymbol\lambda}^{(m)}}\left(p(\boldsymbol{\theta},\boldsymbol{y})-q_{\boldsymbol\lambda^{(m)}}(\boldsymbol{\theta})\right)$$`
---

# Monte Carlo approximation

The Gradient

`$$\nabla_{{\boldsymbol\lambda}^{(m)}}={\class{emphcolb}\int}\Psi(\boldsymbol{\theta}){\class{emphcolb}{q_{\boldsymbol\lambda^{(m)}}(\boldsymbol{\theta})d{\boldsymbol{\theta}}}}$$`
--

can be approximated by drawing `$\boldsymbol{\theta}^{[1]},\ldots\boldsymbol{\theta}^{[S]}\sim q_{\boldsymbol{\lambda}^{(m)}}$` and computing

`$$\nabla_{{\boldsymbol\lambda}^{(m)}}\approx{\class{emphcolb}{\frac{1}{S}\sum\limits_{s=1}^S}}\Psi(\boldsymbol{\theta}^{[s]})$$`

---
class: center, middle, inverse

# Updating Variational Bayes

---

# Bayesian updating

- Recall Bayesian updating
--
`$$\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}\propto p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$`
--

- After observing `$\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}$`
--

`$$p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_2})\propto p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{p(\boldsymbol{\theta}|\boldsymbol{y}_{1:T_1})}$$`
--

- How to adapt this for Variational Inference?

---

# Updating Variational Bayes

- Replace `$p$`'s with `$q$`'s.
--
`$$\class{emphcolb}{q_{\boldsymbol{\lambda}_1}}(\boldsymbol{\theta}){\underset{\sim}{\propto}} p(\boldsymbol{y}_{1:T_1}|\boldsymbol{\theta})p(\boldsymbol{\theta})$$`
--

- After observing `$\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}$`
--

`$${q_{\boldsymbol{\lambda}_2}}(\boldsymbol{\theta}){\underset{\sim}{\propto}} p(\class{emphcol}{\boldsymbol{y}_{T_1+1:T_2}}|\boldsymbol{\theta})\class{emphcolb}{q_{\boldsymbol{\lambda}_1}}(\boldsymbol{\theta})$$`
--

- This is Updating Variational Bayes (UVB).
--

- For UVB likelihood involves `$T_{i+1}-T_{i}$` terms 
- For Standard VB (SVB) it involves `$T_i$` terms.

---

# Why is it faster?

- Can start UVB before SVB has even finished collecting the data.

<div id="htmlwidget-36872d0026b265f58756" style="width:800px;height:400px;" class="DiagrammeR html-widget"></div>
<script type="application/json" data-for="htmlwidget-36872d0026b265f58756">{"x":{"diagram":"\ngantt\ndateFormat  YYYY-MM-DD\ntitle Method Comparion\n\nsection UVB\nData t=1 to t=T1  :done, obs_1, 2019-01-07, 3w\nSGA 1 :done, dec_1, after obs_1, 1w\nData t=T1 to t=T2 :done, obs_2, after obs_1, 4w\nSGA 2 :done, dec_2, after obs_2, 5d\nData t=T2 to t=Tn :crit,  obs_3, after obs_2, 4w\nSGA n :active, dec_3, after obs_3, 5d\n\nsection SVB\nData t=1 to t=T1  :done, obs_1s, 2019-01-07, 3w\nSGA 1 :done, dec_1s, after obs_1s, 1w\nData t=1 to t=T2 :done, obs_2s, 2019-01-07, 7w\nSGA 2 :done, dec_2s, after obs_2s, 10d\nData t=1 to t=Tn :crit,  obs_3s, 2019-01-07, 11w\nSGA n :active, dec_3s, after obs_3s, 12d\n"},"evals":[],"jsHooks":[]}</script>

---

# Updating Variational Bayes

.center[![](img/UVB.gif)]

---

# Standard Variational Bayes

.center[![](img/VB.gif)]

---

# Markov chain Monte Carlo

.center[![](img/MCMC.gif)]
---

# An even faster way

- At time `$T_i$` and step `$m$` of SGA
--

- Draw values of `$\boldsymbol{\theta}$` from `$q_{\boldsymbol{\lambda}^{(m)}_{i}}$`.
--

- Compute `$\Psi(.)$` at each of these draws to update to `$\boldsymbol{\lambda}^{(m+1)}_{i}$`.
--

- We propose an algorithm that does not repeat these steps as `$m$` is incremented.
--

- Use `$q_{\boldsymbol{\lambda}^*_{i-1}}$` as an importance density for entire SGA at time `$T_i$`.

---

# Importance Sampling

Monte Carlo: Draw `$\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} q_{\boldsymbol{\lambda}^{(m)}_i}$` and
--

`$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \Psi(\boldsymbol{\theta}^{[j]})$$`
--

Importance Sampling: Draw `$\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} q_{\boldsymbol{\lambda}^*_{i-1}}$`
--

`$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \class{emphcolb}{\frac{q_{\boldsymbol{\lambda}^{*}_{i-1}}(\boldsymbol{\theta}^{[j]})}{q_{\boldsymbol{\lambda}^{(m)}_{i}}(\boldsymbol{\theta}^{[j]})}}\Psi(\boldsymbol{\theta}^{[j]})$$`

---

#Importance Sampling

Monte Carlo: Draw `$\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} \class{emphcol}{q_{\boldsymbol{\lambda}^{(m)}_i}}$` and

`$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \Psi(\boldsymbol{\theta}^{[j]})$$`
Importance Sampling: Draw `$\boldsymbol{\theta}^{[j]}\overset{iid}{\sim} \class{emphcol}{q_{\boldsymbol{\lambda}^*_{i-1}}}$`

`$$\nabla_{\boldsymbol{\lambda}_i^{(m)}}\approx\frac{1}{S}\sum\limits_{s=1}^S \frac{q_{\boldsymbol{\lambda}^{*}_{i-1}}(\boldsymbol{\theta}^{[j]})}{q_{\boldsymbol{\lambda}^{(m)}_i}(\boldsymbol{\theta}^{[j]})}\Psi(\boldsymbol{\theta}^{[j]})$$`

---
# UVB-IS

<img src="img/UVBIS.png" width="405" style="display: block; margin: auto;" />
---

# Compared to Sayaka and Klami

<img src="img/sayaka.png" width="340" style="display: block; margin: auto;" />
---

# Updating Variational Bayes

.center[![](img/UVB.gif)]

---

# Updating Variational Bayes (IS)

.center[![](img/UVBIS.gif)]

---

# Speed at the cost of accuracy

.center[![](img/Crash.gif)]

---
class:: center, middle, inverse

# Evaluation

---

# Simulation 1

- Stationary AR(3) model.
--

- One-step ahead forecasts need to be made at `$t=100,125,150,\ldots,500$`.
--

- Forecast accuracy measured by log score.
--

- Conduct 500 replications.
--

- For `$q$` consider a multivariate mixture of normals with `$K=1,2,3$` components.

---

# Simulation 2

$$
  y_{it}|k_i=j\sim N(\mu_j,\sigma^2_j)
$$

- Number of time series observations and cross-sectional observations both 100 `$t=1,\ldots,100$`, `$i=1,100$`
--

- Each unit comes from one of two distributions `$k_i\in\left\{1,2\right\}\forall i$`.
--

- Aim is classify each unit into the correct component at `$t=10,20,30,\ldots,100$`.
--

- Accuracy measured by proportion of correct classifications.

---

# Simulation 2: Approximation

- Continuous parameters `$\boldsymbol{\theta}$` and discrete parameters `$k_{i}$`.
--

- Marginalise out discrete parameters (component indicators)

`$$p(\boldsymbol{\theta,\boldsymbol{y}})=\prod\limits^n_{i=1}\class{emphcolb}{\sum\limits_{j=1,2}}\prod\limits_{t=1}^T\phi(y_{it};\mu_j,\sigma^2_j)\text{Pr}(k_i=j)p(\boldsymbol{\theta})$$`
--

- Approximate by `$q(\boldsymbol{\theta})$`, and use to obtain estimates of `$\mbox{Pr}(k_i=j|\boldsymbol{y})$`.

---

# Relative cumulative run time

- For SVB, the time taken to conduct inference at time `$T_i$`.
--

- For UVB, the sum of time taken to conduct inference at `$T_1,T_2,\ldots,T_i$`.
--

- This is unfair to our methods.
--

- All times are reported relative to SVB at time `$T_1$` (all methods share this as a first step).

---

# Time comparison (at `$T_2$`)

<div id="htmlwidget-55c98b6595d56b9af1f5" style="width:800px;height:600px;" class="DiagrammeR html-widget"></div>
<script type="application/json" data-for="htmlwidget-55c98b6595d56b9af1f5">{"x":{"diagram":"\ngantt\ndateFormat  YYYY-MM-DD\ntitle Method Comparion\n\nsection UVB\nData t=1 to t=T1  : obs_1, 2019-01-07, 3w\nSGA 1 :crit, dec_1, after obs_1, 1w\nData t=T1 to t=T2 :obs_2, after obs_1, 4w\nSGA 2 :crit, dec_2, after obs_2, 5d\nData t=T2 to t=Tn :obs_3, after obs_2, 4w\nSGA n :  dec_3, after obs_3, 5d\n\nsection SVB\nData t=1 to t=T1  : obs_1s, 2019-01-07, 3w\nSGA 1 :dec_1s, after obs_1s, 1w\nData t=1 to t=T2 : obs_2s, 2019-01-07, 7w\nSGA 2 :crit, dec_2s, after obs_2s, 10d\nData t=1 to t=Tn : obs_3s, 2019-01-07, 11w\nSGA n :dec_3s, after obs_3s, 13d\n"},"evals":[],"jsHooks":[]}</script>

---

# Time comparison (at `$T_3$`)

<div id="htmlwidget-81d36bec963b3d980ef9" style="width:800px;height:600px;" class="DiagrammeR html-widget"></div>
<script type="application/json" data-for="htmlwidget-81d36bec963b3d980ef9">{"x":{"diagram":"\ngantt\ndateFormat  YYYY-MM-DD\ntitle Method Comparion\n\nsection UVB\nData t=1 to t=T1  : obs_1, 2019-01-07, 3w\nSGA 1 :crit, dec_1, after obs_1, 1w\nData t=T1 to t=T2 :obs_2, after obs_1, 4w\nSGA 2 :crit, dec_2, after obs_2, 5d\nData t=T2 to t=Tn :obs_3, after obs_2, 4w\nSGA n : crit, dec_3, after obs_3, 5d\n\nsection SVB\nData t=1 to t=T1  : obs_1s, 2019-01-07, 3w\nSGA 1 :dec_1s, after obs_1s, 1w\nData t=1 to t=T2 : obs_2s, 2019-01-07, 7w\nSGA 2 :dec_2s, after obs_2s, 10d\nData t=1 to t=Tn : obs_3s, 2019-01-07, 11w\nSGA n :crit, dec_3s, after obs_3s, 13d\n"},"evals":[],"jsHooks":[]}</script>

---

# Simulation 1: Timing

---

# Simulation 1: Accuracy

---

# Simulation 2: Timing

---

# Simulation 2: Accuracy

---
class: inverse, middle center

# Predicting Car Position

---

# Car Position

- Predict lateral deviation from centre of lane, every 100 milliseconds.
--

- Train model on `$500$` vehicles and predict another `$500$` vehicles.
--

- Assume similar model to Simulation 2.
--

- Dirichlet process prior on the parameters of each component. 
--

- Number of components is data driven.

---

# Data preprocessing

- Geometry of the road removed by a spline model trained on different set of cars.

---

# Clusters

---

# Forecast accuracy

- Benchmark model is to assume all cars are independent.  Cannot use the information learned about clusters.
--

- Then consider the DPP model with three different inferential methods
--

+ MFVB (assume independence in `$q$`)
    + SVB with 3-component normal mixture
    + UVB-IS with 3-component normal mixture

---

# Results

---

# Conclusions

- The independent model performs poorly.  Need to *borrow strength* from other cars.
--

- The mean field approximation also performs poorly.  Need dependence in the posterior.
--

- The results for SVB and UVB-IS sit almost on top of one another.

---

# Conclusions

- Evidence suggest that updating can be used with VB.
--

- The gains in computational time can be substantial.
--

- The trade-off in terms of accuracy is generally small.
--

- The exact nature of this trade-off will depend on the model and the purposes of the analysis.