Environment
$x$ is a single scalar representing the robot's position on the interval $(0,1)$. It moves left and right according to applied controls $u \in (-1,1)$, and is damped by a potential function shown in the first column of the figure below: there is no damping at $x=0.64$ and the highest damping (least movement) occurs at $x=.2$.
Here, I simulated the robot for $T=1500$ steps. The second column depicts samples $x$ and the control we applied at that state $u$. The third column depicts this same trajectory as $x$ as a function of simulation steps.
Experimental setup
The dataset $D$ consists of 1500 values. In both simple/adaptive schemes, I let the robot wander around for 1500 steps to pre-populate the dataset ("burn-in"), then let it wander around for 30 more "episodes". In each episode 20% of the existing dataset is replaced by new data sampled by the policy.
This means that under both the random and adaptive policies, the robots are allowed to wander the same number of total steps (1500 + 30*150).
Uniform Random Policy:
Explanation of tableau:
rows: episodes 1-5.
columns:
Dataset - distribution of points used to train E2C and if applicable, the exploration policy.
Learned P(U|X) - the policy distribution after training on this cycle's dataset
Reconstruction Loss heatmap (X on x-axis, U on y-axis)
Prediction Loss heatmap (X on x-axis, U on y-axis)
I had previously drawn the L2 loss used to learn E2C, and the reconstructions looked pretty good. However, since $x \in (0,1)$ and the predicted motions are actually quite small, the L2 loss is very dim when visualized.
Adaptive Policy:
The adaptive model converges slightly faster, and is able to move the green points away from existing blue/green points. At a given episode, if the E2C loss at the blue/green regions is already 0 (i.e. via minimization of outer loop), then our updated policy will propose samples $u0$ that strictly increase the loss should it re-encounter those same $x$ values.
Green points correspond to newly-sampled points (via policy) and swapped into the dataset for that given cycle.
Something else that worked was to reset the exploration policy on each episode. The intuition here is that we want the policy to avoid samples that it just proposed in the previous episode (group of simulation steps), because we bootstrap off our previous policy, we'd probably pick similar points again in spite of optimization. So, we re-optimize the policy from an un-trained state on each episode, and just let the weights for the E2C model carry over.
However, it does seem a bit wasteful to have to re-learn everything from scratch on each episode. Need to think more about this.
I also realized that it's not possible to maximize $L_{e2c}$ directly with respect to $u$ - we cannot compute E2C loss without having access to the transition function. This also applies to sampling-based approaches where we pick a bunch of random $u$ and see which one incurs the highest E2C loss - this is impossible b.c. again, we can't compute E2C loss for an un-observed sample.
No comments:
Post a Comment