CS2980 Research Blog: Adversarial Exploration troubles...

I've been running into issues getting "adversarial exploration" to work. I'm not certain if it's a programming bug (something wrong with my tensorflow implementation or the way Python modules are imported into each other) or a theoretical one (getting trapped in a sampling region and the local gradients preventing it from escaping). In any case, it's not working as well as I thought it would, even for the 1D case.

The idea of this "adversarial exploration policy" is to alternate between exploring the state-action space and training E2C, using our existing samples and the current E2C model to decide where to go. The basic idea is to importance-sample observation-action mini-batches according to the distribution of the E2C model's objective function. This prioritizes experiences to favor training samples with high loss, leading to faster convergence of model learning with respect to agent exploration time.

Let $X = (x_i,u_i,x_{i+1})$ be a random variable representing a single tuple in the state-action-next_state space.

For each new episode $c$ where the current state observation is $x_{c-1}$, we solve the minimax problem

$$
\min_{\phi,\psi,\theta} \max_{\pi}\mathbb{E}_{\mathcal{D}_c}{[L_{\text{E2C}}|\mathcal{D}_{c-1}]}
$$
where
$$
\mathcal{D}_c = \mathcal{D}_{c-1} \cup \{(x_{c-1}, u_{c-1}, f(x_c))\}\\
u_{c-1} \sim P_\pi(\cdot | x_{c-1})
$$

What dataset should $\mathcal{D}_c$ converge to? The densities shouldn't matter, as long as E2C reconstruction is robust for whatever planning task we want to do (of course, performance degrades if we spend time exploring rare regions where the robot is unlikely to end up).

Right now I am optimizing $\pi$ over samples drawn from $\mathcal{D}_c$, as I should be for optimizing the inner loop of minimax. I could instead take a fast algorithm that just evaluates the E2C model for latin hypercube / orthogonal sampling over the action space, and just choose the action that results in the highest "predicted" E2C loss. This would be more akin to a policy-free search that simply chooses regions of high loss every step, rather than trying to learn a "value function" across the entire observation space implicitly.

Fixed a weird bug where separating out my Variational autoencoder into a separate module resulted in a computational graph that still trained but *very poorly* (loss was not decreasing). I still do not know why this happened: perhaps a python variable passed as a reference got modified somewhere?

Here are some tableau's I constructed. For some reason the reconstruction line is flat - all the X values are being mapped to the same values.