Tuesday, April 19, 2016

Fixed E2C Implementation Mistake

*facepalm

Just realized that I made an error in my E2C implementation. The transition dynamics in the latent space $Z$ are sampled as follows:
$$
z_t \sim N(\mu_t,\Sigma_t) \\
\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)\\
C_t=A_t\Sigma_t{A}_t^T
$$

Previously, I had come up with transformation of a sample $i \sim N(0,\Sigma_t)$ into the desired sample $\hat{z}_{t+1}$. While the sample is indeed drawn from the right distribution, a closer inspection shows that this is totally wrong approach from a physical interpretation:

In a locally-linear dynamical system, the next state is determined via $z_{t+1} = A_tz_t+B_tu_t+o_t$, where $A$ is the Jacobian w.r.t. autonomous (latent) state dynamics and $B$ is the Jacobian w.r.t. applied controls $u$.

I don't think there should be nothing deterministic about this transition process. $\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)$ is a random variable not because the transition dynamics are probabilistic, but actually because $z_{t+1}$ is a deterministic transformation of random variable $z_t$. In other words, we desire the variance of $||z_{t+1}-z_t||$ to be 0.

If we were re-sample $\hat{z}_{t+1}$ probabilistically, we are adding "extra noise into the transition" from the second sampling of $N(0,1)$. The variance of $z_{t+1}$ remains the same, but we've increased the variance over $||z_{t+1}-z_t||$, so the variance of our gradients is going to be greater. Another way to look at this is that we are "de-correlating" our sample of $z_t$ and our sample of $z_{t+1}$ when they should in fact be very-coupled. We are sampling $\hat{z}_{t+1}$ assuming that each sample of $z_t$ are exactly equal to the mean $\mu_t$.

The fact that $z_{t+1}$ should be a re-parameterized version of $z_t$ rather than re-sampled seems obvious in hindsight, but I just realized that this is the entire method by which the re-parametrization trick reduces variance. If we have a set of random variables $x_1, x_2, ... x_N$ we need to sample from, instead of sampling $N$ different noise functions, we sample a single $N(0,1)$ and re-use that noise sample to compute every random variable we need.

Fixing this bug makes the latent space images nicer. The below was trained for 2e5 iterations, with max step size = 3.




Still not perfect, but this was without any of the multistep / KL / large step-size tricks I had previously tried.

No comments:

Post a Comment