Monday, May 16, 2016

Adaptive Exploration on Plane Task

I got the Plane E2C task to work: the adaptive policy is more adept at quickly finding areas of high E2C loss.

Left column is database samples (fixed size of 3000 points), right column is marginal E2C loss (mean over all possible U for a given position).

Random policy:
Adaptive Policy:

I repeated this experiment on a different environment map, one with a narrow corridor.

Random policy:


Adaptive policy:


Notice how in both cases, the adaptive scheme's dataset covers the full training space sooner than the random policy.


Friday, May 13, 2016

Got the 1D simplified model to work

Adaptive exploration is working now, on the 1D task. I simplified the codebase (I think I introduced several bugs by trying to over-engineer the experimental setup and make it "generalize" for a variety of different simulation scenarios).

Environment
$x$ is a single scalar representing the robot's position on the interval $(0,1)$. It moves left and right according to applied controls $u \in (-1,1)$, and is damped by a potential function shown in the first column of the figure below: there is no damping at $x=0.64$ and the highest damping (least movement) occurs at $x=.2$.

Here, I simulated the robot for $T=1500$ steps. The second column depicts samples $x$ and the control we applied at that state $u$. The third column depicts this same trajectory as $x$ as a function of simulation steps.
Experimental setup

The dataset $D$ consists of 1500 values. In both simple/adaptive schemes, I let the robot wander around for 1500 steps to pre-populate the dataset ("burn-in"), then let it wander around for 30 more "episodes". In each episode 20% of the existing dataset is replaced by new data sampled by the policy.

This means that under both the random and adaptive policies, the robots are allowed to wander the same number of total steps (1500 + 30*150).

Uniform Random Policy:
Explanation of tableau:

rows: episodes 1-5.
columns:
Dataset - distribution of points used to train E2C and if applicable, the exploration policy.
Learned P(U|X) - the policy distribution after training on this cycle's dataset
Reconstruction Loss heatmap (X on x-axis, U on y-axis)
Prediction Loss heatmap (X on x-axis, U on y-axis)


I had previously drawn the L2 loss used to learn E2C, and the reconstructions looked pretty good. However, since $x \in (0,1)$ and the predicted motions are actually quite small, the L2 loss is very dim when visualized.





Adaptive Policy:

The adaptive model converges slightly faster, and is able to move the green points away from existing blue/green points. At a given episode, if the E2C loss at the blue/green regions is already 0 (i.e. via minimization of outer loop), then our updated policy will propose samples $u0$ that strictly increase the loss should it re-encounter those same $x$ values.

Green points correspond to newly-sampled points (via policy) and swapped into the dataset for that given cycle.


Something else that worked was to reset the exploration policy on each episode. The intuition here is that we want the policy to avoid samples that it just proposed in the previous episode (group of simulation steps), because we bootstrap off our previous policy, we'd probably pick similar points again in spite of optimization. So, we re-optimize the policy from an un-trained state on each episode, and just let the weights for the E2C model carry over.

However, it does seem a bit wasteful to have to re-learn everything from scratch on each episode. Need to think more about this.

I also realized that it's not possible to maximize $L_{e2c}$ directly with respect to $u$ - we cannot compute E2C loss without having access to the transition function. This also applies to sampling-based approaches where we pick a bunch of random $u$ and see which one incurs the highest E2C loss - this is impossible b.c. again, we can't compute E2C loss for an un-observed sample.




Thursday, May 12, 2016

Adaptive AE toy task

E2C is too complicated to deal with, so I simplified the model and am just using a simple autoencoder to do prediction and reconstruction. No dynamics or anything fancy.

Modified the potential function, with datasets deliberately occupying fewer samples.

Random policy



Adaptive policy



Note: the adaptive policy uses random noise, but it seems to perform poorly when sampleNormal $e \sim N(0,1)$ noise. Further investigation needed.


Couple notes to self:

  • When there are issues with "need to feed placeholders", it's likely b.c. you forgot to set var_list in optimizer.
  • These simple models train faster on CPU than GPU.

Wednesday, May 11, 2016

Adversarial Exploration troubles...

I've been running into issues getting "adversarial exploration" to work. I'm not certain if it's a programming bug (something wrong with my tensorflow implementation or the way Python modules are imported into each other) or a theoretical one (getting trapped in a sampling region and the local gradients preventing it from escaping). In any case, it's not working as well as I thought it would, even for the 1D case.

The idea of this "adversarial exploration policy" is to alternate between exploring the state-action space and training E2C, using our existing samples and the current E2C model to decide where to go. The basic idea is to importance-sample observation-action mini-batches according to the distribution of the E2C model's objective function. This prioritizes experiences to favor training samples with high loss, leading to faster convergence of model learning with respect to agent exploration time.

Let $X = (x_i,u_i,x_{i+1})$ be a random variable representing a single tuple in the state-action-next_state space.

For each new episode $c$ where the current state observation is $x_{c-1}$, we solve the minimax problem

$$
\min_{\phi,\psi,\theta} \max_{\pi}\mathbb{E}_{\mathcal{D}_c}{[L_{\text{E2C}}|\mathcal{D}_{c-1}]}
$$
where
$$
\mathcal{D}_c = \mathcal{D}_{c-1} \cup \{(x_{c-1}, u_{c-1}, f(x_c))\}\\
u_{c-1} \sim P_\pi(\cdot | x_{c-1})
$$

What dataset should $\mathcal{D}_c$ converge to? The densities shouldn't matter, as long as E2C reconstruction is robust for whatever planning task we want to do (of course, performance degrades if we spend time exploring rare regions where the robot is unlikely to end up).

Right now I am optimizing $\pi$ over samples drawn from $\mathcal{D}_c$, as I should be for optimizing the inner loop of minimax. I could instead take a fast algorithm that just evaluates the E2C model for latin hypercube / orthogonal sampling over the action space, and just choose the action that results in the highest "predicted" E2C loss. This would be more akin to a policy-free search that simply chooses regions of high loss every step, rather than trying to learn a "value function" across the entire observation space implicitly.

Fixed a weird bug where separating out my Variational autoencoder into a separate module resulted in a computational graph that still trained but *very poorly* (loss was not decreasing). I still do not know why this happened: perhaps a python variable passed as a reference got modified somewhere?

Here are some tableau's I constructed. For some reason the reconstruction line is flat - all the X values are being mapped to the same values.


random sampling.

Stochastic policy seemed to be pretty important, the adaptive results are somewhat promising, though significantly worse than I'd have hoped.




Here's the reconstruction on the plane task



In order to demonstrate that adaptive E2C works, I would need to try a even simpler VAE or AE reconstruction of 1D input.