Exponential Smoothing, Note 10/25/2020

Over the course of implementing a pipeline for Bayesian inference using the Stein variational method and experimenting with all of its latest extensions, I ran into the need for a simple way to track my particles as they flow along the posterior landscape.

Take me straight to the animation!

A digression on Bayesian inference.

Briefly, Stein-variational inference is a nonparametric way to sample an intractable distribution. In that sense it competes directly with the Markov-chain Monte Carlo family of methods. For any arbitrary distribution \(p(z),\) we only need to know the gradients of its logarithm, \(\nabla \log p(z),\) in order to approximate it with our particles. Such techniques are paramount to approximate Bayesian inference, in which we look to sample a distribution with an unknown normalizing constant (known in some circles as the partition function \(Z(\theta)\) since it depends on the distribution's parameters.) Recall the famous formula, which stems from the very foundations of probability theory: \[ p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)} \] One could argue that the entire scientific principle is encoded in this simple equation. How do we translate the observation of some \( x \) into knowledge about a hypothetical model of the world, denoted as \(\theta \)? We pack all the things we already know, along with philosophical considerations like Occam's razor, into the prior distribution \(p(\theta).\) Our other ingredient is the probability density of any piece of data under a given model, \(p(x|\theta).\) Bayes' rule flips this quantity on its head to yield a distribution of potential models that suit a particular observation: \(p(\theta|x).\)

Note that a lot of people do Bayesian analysis without even realizing it. Whenever you take an objective function and treat it like a log-likelihood (in the parlance of physicists, you may call this an energy function,) and employ any of the numerous methods to sample your model's parameters with respect to that, you are implicitly sampling the posterior. Your priors are simply uninformative for whatever domain you have chosen, but that doesn't necessarily give you pleasant guarantees about the estimators that come out of it. It's a tricky subject.

By the law of total probability, we can aggregate the likelihood over all possible models belonging to the family parametrized by \(\theta\): \[ p(\theta|x) = \frac{p(x|\theta)p(\theta)}{\int p(x|\theta)p(\theta)\mathrm{d}\theta} \] Whether or not particular a choice of model family (e.g. linear regression, polynomial interpolation, deep neural networks) is the correct one falls under aleatory uncertainty, which is a scientific question that sometimes verges on philosophy. This posterior \(p(\theta|x)\) may be analytic, but typically we either have to estimate the denominator empirically or come up with clever ways to cirvumvent it. This is where the log-posterior's (or log-likelihood's) gradients come in handy: the integral that shows up as a normalizing constant disappears in this quantity! Hence the value of Stein inference and the like lies in its sole reliance on \(\nabla_\theta \log p(\theta|x)\) in order to perform its optimization.

Stein variational inference.

Equipped with the above Bayesian primer, we can think about finding a \(\theta\) that maximizes the posterior for the given sample \(x.\) A successful foray would lead to maximum a posteriori estimation. But how do we reckon with uncertainty? The posterior contains all kinds of information to be extracted in order to inform our state of epistemic knowledge. To characterize the distribution more completely, we turn to inference rather than estimation. I might elucidate on this subject in a future note; for now, we recognize that the theory underpinning Stein variational inference allows us to effectively sample a posterior vis-à-vis particles. These particles are represented as vectors in the parameter space. We solve local optimization problems (employing gradients and Hessians of the log-density) to inch them closer to regions of high probability. Nearby particles have their update directions blended together, and are also repelled from each other a bit in order to spread out across the distribution function. It's a balancing act: the repulsion is commensurate to changes in probability level that are deemed worthwhile. The degree of smoothing acts similarly. Empirically, it seems to help small escape valleys in the energy landscape.

All of these rather subjective decisions are addressed compactly through our choice of kernel function, which defines the proximity (between zero and one) between particles. Formally, it constructs a Reproducing-kernel Hilbert space (RKHS) so that we can maximize the Stein discrepancy between the particles' empirical distribution estimate and the real posterior, all analytically. We locally minimize that in each optimization step, as if in a min-max adversarial context. If the family of functions determined by the RKHS is expressive enough, the framework asymptotically guarantees to approach the real distribution with our particles.

I took a rather convoluted path to explain the main purpose of this note. Essentially, I needed an efficient way to monitor these kernels during training, since a great deal of tuning is often necessary.

The beauty of exponential moving averages.

Whenever you are faced with a stream of numbers, and you would like to keep a running rally of the “average” smoothed over a general time horizon, there is an efficient and straightforward technique. Its special quality is that you do not have to keep track of a whole window of observations. Otherwise, you would have to maintain something like an array to hold a number of the most recent values, and shift the whole thing every time you place a new measurement at the head and pop the tail. Ring buffers alleviate much of this trouble, but they can still be a pain. Instead, I would like to provide some intuition for the exponential-smoothing approach.

Various texts implement slightly differing version of it because financial traders, general forecasters, and other field practitioners have adopted it to their own liking. The premise is simple: keep track of a single sum \(y_t\), and every time you receive a new datum \( x_t\) you update the sum with a new weighted average between the new measurement and the old average. \[ y_t = (1-\alpha)x_t + \alpha y_{t-1}\] Clearly, there is no hard cutoff in this rolling average: all data is remembered forever, but the contribution of each point to the current tally diminishes by \( 0<\alpha<1 \) each time a new measurement is recorded. Its window decays exponentially over the past, since the weight of any point that is \( t \) units behind is effectively \(\alpha^t\).

With data \(x_1, x_2, x_3\dots x_N,\) the estimator at the very beginning \(y_0\) must be initialized to something. Often, that is zero. Here the Adam optimizer of deep-learning fame, relying on a rolling estimate of the gradients' variance, takes advantage of the fact that we know precisely how much weight is placed on that initial zero, i.e. our bias. We can then correct for it: \[ \hat x_t = \frac{y_t}{1-(1-\alpha)^t}.\]

Here's the trick when it comes to Stein inference: when you are sampling the parameters for a model with many components, you probably compute individual kernels component-wise (each with their corresponding bandwidths that must be tuned.) Then you multiply them together. Therefore we need to be pretty sensitive to kernel values really close to zero, as they will “destroy” the whole proximity value. We can take the exponential moving average of their logarithms and evaluate the exponential of \(\hat x_t\); this translates to a rolling, exponentially weighted geometric mean of the proximities.

Usually \(\alpha\) is set to a pretty small value. To garner an idea about the real effect of different choices, I present below a simple visual. Click anywhere in the illustration to extract an item from the belt of incoming values on the right, and place it into one of the slots in the black box that represents the rolling average. Its size as a fraction of the box's represents the item's weight. Each existing entry is proportionally diminished with the introduction of a new item. After a few insertions, the exponential weighting pattern will become clear as a consequence of the cumulative averaging. The box's outline color will reflect the smoothed value of all of its entries. The blank space occupying the lower portion of the box will gradually decay to nothing, at a pace of \(t\mapsto(1-\alpha)^t.\) You might notice that it takes a while to completely fill up, since in fact it never really does!

The relevant meaning of \(\alpha\), which you can peruse below, lies in adjusting the rolling average's forgetfulness. With lower values, a greater importance is placed on the latest observations, but to the detriment of smoothness. Enjoy! This was also a fun excuse to build my own PID controller for the animations.

\(\alpha=0.01\) Click below for a demonstration.

\(\alpha=0.05\)

\(\alpha=0.1\)

\(\alpha=0.3\)