Here's a list of stuff I want to try:

**Stuff I'm pretty sure will work: **
- Use spatial transformers to localize the agent before passing it to the encoder, so that we only encode state variables relevant to the robot pose (and not the entire system). Of course,
- In the paper, the authors pass in the previous frame in addition to current frame. But what if instead we passed in the difference between the current and last frame (i.e. $x_t-x_{t-1}$)? Seems like the sparsity will make it easier to infer velocity.
- Use ResNets for computing the transition network. For most reasonable systems, I hypothesize that it's easier to sample a perturbation $\Delta z$ and then compute $z_{t+1} = z_t + \Delta z$ rather than sampling $z_{t+1}$ unconditionally. This is just like using an additive canvas in DRAW.

**Stuff I hope will work:**
E2C uses the following pipeline:

- Sample trajectory starting states $x_0$.
- The exploration policy $\pi_\rho$ samples random control trajectories, and we feed these into the simulation with $x_0$ to generate action-state-transition tuples $(x_t,u_t,x_{t+1})$.
- Use the E2C framework to learn the latent space & transition model.
- Use the model + optimal trajectory algorithms to do planning.

For all $x_t$ achievable in simulation, we want our dataset to have samples $(x_t,u_t,x_{t+1})$ corresponding to all possible controls $u_t$ in the action space applied to that state.

Right now, my "exploration policy" doesn't use any information from the ongoing simulation. In fact, I just uniformly sample over the action space without any knowledge of the current observation.

Intuitively, we'd instead like to get the robot to

*interactively* explore the state-action space as fully as possible, particularly regions where the reconstruction or prediction loss is high. During training, we want to draw more samples $x_t$ from the state space (I use $x$ as a proxy for "state" even though it actually means "observation" in E2C) that correspond to high reconstruction losses. Similarly, we want to draw more samples $(x_t,u_t,x_{t+1})$ that result in high prediction loss, so we can hopefully obtain more useful gradients.

This is akin to adaptive importance-sampling regions of high noise in path tracing!

We could compute $\nabla_\rho x_t$, and perhaps use that to query what states the robot should sample more of, but in general we cannot compute what constitutes a valid state, short of solving a trajectory optimization algorithm.

Instead, I propose an "exploration policy" that adaptively samples $u_t$ to encourage exploration of the space. There are some possible approaches:

- We can do gradient ascent on the reconstruction loss $L_\text{bound}$ with w.r.t. $\rho$, i.e. $\rho := \rho + \nabla_\rho L_\text{bound}$, so the exploration policy proposes $u_t$'s that tend to increase E2C loss. The "plain English" explanation of this is that we want the robot to re-try actions that it failed to predict well for some previous state.
- Of course, the loss is guaranteed to be increased only if we start from the same state $x_t$, but our new starting state has changed to $x_{t+1}$. I don't think there's much we can do about this, since in real life robot training we probably have to manually initialize starting states and let most of the data come from a single trajectory. But maybe $x_{t+1}$ is sufficiently close to $x_t$ that this doesn't matter...
- It's easy to see how this can fail to converge if the robot is constantly chasing bad actions, overfitting, then un-learning all its previous understanding of the state space, so the composition of each training minibatch needs to contain some past samples.

At time $t$, we have our updated $\rho$ from the exploration policy and apply it immediately to the robot to obtain $(x_t,u_t,x_{t+1}$.

**Adversarial Exploration Policy Gradient**
Here's a better idea: at time $t$, we use $x_t$ to compute $u_t$, i.e. $\pi_\rho$ is an adversarial policy that attempts to maximize $L$ given $x_t$.

$$

u_t \sim \pi_\rho(\cdot|x_t) \\

\rho := \rho + \nabla_\rho L_\text{bound}

$$

Over $M$ simulation time steps, we accumulate $M$ samples of $(x_t,u_t,x_{t+1}$, which we then use to minibatch-update the $\pi_rho$ jointly with the E2C model (we'll need two copies of the net, one with single-batch and minibatch for passing $x_{t:t+M}$ through).

To prevent overfitting to the (highly) temporally-correlated M-trajectory, we let a fraction of the samples be drawn from an experience replay database. In this case, the adversarial policy also gets exposed to experience replay. I'm working through the paper

Prioritized Experience Replay, which has a lot of interesting ideas that I need to think about more on this front.

I think this is pretty elegant because we're using model-free control (exploration policy) to act as an importance sampling scheme for learning a model-based RL system (E2C), and the whole thing is coupled via the trendy notion of "dueling architectures".

**Exploration without E2C information**
A simpler option is that we don't involve E2C at all in the exploration policy, but instead let the exploration policy try to figure out the space on its own.

In the case of the plane control task, the entire state space is computable, so sampling is relatively easy. In general though, the topology of the state space is unknown, so we'd need to devise some kind of universal exploration function to properly characterize the space.

Not sure if there are existing approaches...

**Stuff that didn't seem to work right away, but I'm still interested in pursuing:**
- Multistep E2C - I'd like to measure the total prediction loss over a trajectory under a single-step E2C model trained for $2N$ iterations, and then compare that to the same loss after a multistep model (initialized with E2C weights trained for $N$ iters) has been trained for an additional $N$ iterations.
- We can even combine multi-step E2C with sequential generative models between each time step. Here's where very deep capabilities of ResNets would be powerful...

Are there software for organizing exploratory research ideas such as
these? I have the basic E2C model, and I want to know what happens when I
extend it in various ways (add attention, use residual generation,
multistep), and I'm also interested in what happens when I combine these
all together. It's also important from an experimental standpoint to
vary experimental parameters incrementally, otherwise I can't know for
sure what a change does to the model. I know Domino Data Labs does
something like this, but I think they're more focused on data science /
analysis than model-building.