## Monday, May 16, 2016

I got the Plane E2C task to work: the adaptive policy is more adept at quickly finding areas of high E2C loss.

Left column is database samples (fixed size of 3000 points), right column is marginal E2C loss (mean over all possible U for a given position).

Random policy:

I repeated this experiment on a different environment map, one with a narrow corridor.

Random policy:

Notice how in both cases, the adaptive scheme's dataset covers the full training space sooner than the random policy.

## Friday, May 13, 2016

### Got the 1D simplified model to work

Adaptive exploration is working now, on the 1D task. I simplified the codebase (I think I introduced several bugs by trying to over-engineer the experimental setup and make it "generalize" for a variety of different simulation scenarios).

Environment
$x$ is a single scalar representing the robot's position on the interval $(0,1)$. It moves left and right according to applied controls $u \in (-1,1)$, and is damped by a potential function shown in the first column of the figure below: there is no damping at $x=0.64$ and the highest damping (least movement) occurs at $x=.2$.

Here, I simulated the robot for $T=1500$ steps. The second column depicts samples $x$ and the control we applied at that state $u$. The third column depicts this same trajectory as $x$ as a function of simulation steps.
Experimental setup

The dataset $D$ consists of 1500 values. In both simple/adaptive schemes, I let the robot wander around for 1500 steps to pre-populate the dataset ("burn-in"), then let it wander around for 30 more "episodes". In each episode 20% of the existing dataset is replaced by new data sampled by the policy.

This means that under both the random and adaptive policies, the robots are allowed to wander the same number of total steps (1500 + 30*150).

Uniform Random Policy:
Explanation of tableau:

rows: episodes 1-5.
columns:
Dataset - distribution of points used to train E2C and if applicable, the exploration policy.
Learned P(U|X) - the policy distribution after training on this cycle's dataset
Reconstruction Loss heatmap (X on x-axis, U on y-axis)
Prediction Loss heatmap (X on x-axis, U on y-axis)

I had previously drawn the L2 loss used to learn E2C, and the reconstructions looked pretty good. However, since $x \in (0,1)$ and the predicted motions are actually quite small, the L2 loss is very dim when visualized.

The adaptive model converges slightly faster, and is able to move the green points away from existing blue/green points. At a given episode, if the E2C loss at the blue/green regions is already 0 (i.e. via minimization of outer loop), then our updated policy will propose samples $u0$ that strictly increase the loss should it re-encounter those same $x$ values.

Green points correspond to newly-sampled points (via policy) and swapped into the dataset for that given cycle.

Something else that worked was to reset the exploration policy on each episode. The intuition here is that we want the policy to avoid samples that it just proposed in the previous episode (group of simulation steps), because we bootstrap off our previous policy, we'd probably pick similar points again in spite of optimization. So, we re-optimize the policy from an un-trained state on each episode, and just let the weights for the E2C model carry over.

However, it does seem a bit wasteful to have to re-learn everything from scratch on each episode. Need to think more about this.

I also realized that it's not possible to maximize $L_{e2c}$ directly with respect to $u$ - we cannot compute E2C loss without having access to the transition function. This also applies to sampling-based approaches where we pick a bunch of random $u$ and see which one incurs the highest E2C loss - this is impossible b.c. again, we can't compute E2C loss for an un-observed sample.

## Thursday, May 12, 2016

E2C is too complicated to deal with, so I simplified the model and am just using a simple autoencoder to do prediction and reconstruction. No dynamics or anything fancy.

Modified the potential function, with datasets deliberately occupying fewer samples.

Random policy

Note: the adaptive policy uses random noise, but it seems to perform poorly when sampleNormal $e \sim N(0,1)$ noise. Further investigation needed.

Couple notes to self:

• When there are issues with "need to feed placeholders", it's likely b.c. you forgot to set var_list in optimizer.
• These simple models train faster on CPU than GPU.

## Wednesday, May 11, 2016

I've been running into issues getting "adversarial exploration" to work. I'm not certain if it's a programming bug (something wrong with my tensorflow implementation or the way Python modules are imported into each other) or a theoretical one (getting trapped in a sampling region and the local gradients preventing it from escaping). In any case, it's not working as well as I thought it would, even for the 1D case.

The idea of this "adversarial exploration policy" is to alternate between exploring the state-action space and training E2C, using our existing samples and the current E2C model to decide where to go. The basic idea is to importance-sample observation-action mini-batches according to the distribution of the E2C model's objective function. This prioritizes experiences to favor training samples with high loss, leading to faster convergence of model learning with respect to agent exploration time.

Let $X = (x_i,u_i,x_{i+1})$ be a random variable representing a single tuple in the state-action-next_state space.

For each new episode $c$ where the current state observation is $x_{c-1}$, we solve the minimax problem

$$\min_{\phi,\psi,\theta} \max_{\pi}\mathbb{E}_{\mathcal{D}_c}{[L_{\text{E2C}}|\mathcal{D}_{c-1}]}$$
where
$$\mathcal{D}_c = \mathcal{D}_{c-1} \cup \{(x_{c-1}, u_{c-1}, f(x_c))\}\\ u_{c-1} \sim P_\pi(\cdot | x_{c-1})$$

What dataset should $\mathcal{D}_c$ converge to? The densities shouldn't matter, as long as E2C reconstruction is robust for whatever planning task we want to do (of course, performance degrades if we spend time exploring rare regions where the robot is unlikely to end up).

Right now I am optimizing $\pi$ over samples drawn from $\mathcal{D}_c$, as I should be for optimizing the inner loop of minimax. I could instead take a fast algorithm that just evaluates the E2C model for latin hypercube / orthogonal sampling over the action space, and just choose the action that results in the highest "predicted" E2C loss. This would be more akin to a policy-free search that simply chooses regions of high loss every step, rather than trying to learn a "value function" across the entire observation space implicitly.

Fixed a weird bug where separating out my Variational autoencoder into a separate module resulted in a computational graph that still trained but *very poorly* (loss was not decreasing). I still do not know why this happened: perhaps a python variable passed as a reference got modified somewhere?

Here are some tableau's I constructed. For some reason the reconstruction line is flat - all the X values are being mapped to the same values.

random sampling.

Stochastic policy seemed to be pretty important, the adaptive results are somewhat promising, though significantly worse than I'd have hoped.

Here's the reconstruction on the plane task

In order to demonstrate that adaptive E2C works, I would need to try a even simpler VAE or AE reconstruction of 1D input.

## Thursday, April 21, 2016

### Fixed gRPC compilation

After struggling to get gRPC working between C++/Python for about a week, I figured it out:

- in QT, I can add the helloworld cpp example directory (from gRPC) into the include path, but in order for this to work I also need to add the relevant headers I want to use into the directory.

I have C++ talking to Python now, time to serialize images and send them over.

## Wednesday, April 20, 2016

### some experimental results

Re-ran some experiments from yesterday using longer max step sizes and multistep E2C. I also changed the sampling scheme to pick trajectories and times without replacement (but it doesn't have much of an effect for this robot). Here are the latent space reconstructions:

Single-step E2C, max_step=1

Single-step E2C, max_step=10

Multi-step E2C, max_step=1
I tried re-running the multistep training initialized to the single-step max_step=1 weights, but the gradients immediately blew up.

## Tuesday, April 19, 2016

### Idea Dump

Here's a list of stuff I want to try:
Stuff I'm pretty sure will work:
• Use spatial transformers to localize the agent before passing it to the encoder, so that we only encode state variables relevant to the robot pose (and not the entire system). Of course,
• In the paper, the authors pass in the previous frame in addition to current frame. But what if instead we passed in the difference between the current and last frame (i.e. $x_t-x_{t-1}$)? Seems like the sparsity will make it easier to infer velocity.
• Use ResNets for computing the transition network. For most reasonable systems, I hypothesize that it's easier to sample a perturbation $\Delta z$ and then compute $z_{t+1} = z_t + \Delta z$ rather than sampling $z_{t+1}$ unconditionally. This is just like using an additive canvas in DRAW.
Stuff I hope will work:

E2C uses the following pipeline:
1. Sample trajectory starting states $x_0$.
2. The exploration policy $\pi_\rho$ samples random control trajectories, and we feed these into the simulation with $x_0$ to generate action-state-transition tuples $(x_t,u_t,x_{t+1})$.
3. Use the E2C framework to learn the latent space & transition model.
4. Use the model + optimal trajectory algorithms to do planning.
For all $x_t$ achievable in simulation, we want our dataset to have samples $(x_t,u_t,x_{t+1})$ corresponding to all possible controls $u_t$ in the action space applied to that state.

Right now, my "exploration policy" doesn't use any information from the ongoing simulation. In fact, I just uniformly sample over the action space without any knowledge of the current observation.

Intuitively, we'd instead like to get the robot to interactively explore the state-action space as fully as possible, particularly regions where the reconstruction or prediction loss is high. During training, we want to draw more samples $x_t$ from the state space (I use $x$ as a proxy for "state" even though it actually means "observation" in E2C) that correspond to high reconstruction losses. Similarly, we want to draw more samples $(x_t,u_t,x_{t+1})$ that result in high prediction loss, so we can hopefully obtain more useful gradients.

This is akin to adaptive importance-sampling regions of high noise in path tracing!

We could compute $\nabla_\rho x_t$, and perhaps use that to query what states the robot should sample more of, but in general we cannot compute what constitutes a valid state, short of solving a trajectory optimization algorithm.

Instead, I propose an "exploration policy" that adaptively samples $u_t$ to encourage exploration of the space. There are some possible approaches:
• We can do gradient ascent on the reconstruction loss $L_\text{bound}$ with w.r.t. $\rho$, i.e. $\rho := \rho + \nabla_\rho L_\text{bound}$, so the exploration policy proposes $u_t$'s that tend to increase E2C loss. The "plain English" explanation of this is that we want the robot to re-try actions that it failed to predict well for some previous state.
• Of course, the loss is guaranteed to be increased only if we start from the same state $x_t$, but our new starting state has changed to $x_{t+1}$. I don't think there's much we can do about this, since in real life robot training we probably have to manually initialize starting states and let most of the data come from a single trajectory. But maybe $x_{t+1}$ is sufficiently close to $x_t$ that this doesn't matter...
• It's easy to see how this can fail to converge if the robot is constantly chasing bad actions, overfitting, then un-learning all its previous understanding of the state space, so the composition of each training minibatch needs to contain some past samples.
At time $t$, we have our updated $\rho$ from the exploration policy and apply it immediately to the robot to obtain $(x_t,u_t,x_{t+1}$.

Here's a better idea: at time $t$, we use $x_t$ to compute $u_t$, i.e. $\pi_\rho$ is an adversarial policy that attempts to maximize $L$ given $x_t$.

$$u_t \sim \pi_\rho(\cdot|x_t) \\ \rho := \rho + \nabla_\rho L_\text{bound}$$

Over $M$ simulation time steps, we accumulate $M$ samples of $(x_t,u_t,x_{t+1}$, which we then use to minibatch-update the $\pi_rho$ jointly with the E2C model (we'll need two copies of the net, one with single-batch and minibatch for passing $x_{t:t+M}$ through).

To prevent overfitting to the (highly) temporally-correlated M-trajectory, we let a fraction of the samples be drawn from an experience replay database. In this case, the adversarial policy also gets exposed to experience replay. I'm working through the paper Prioritized Experience Replay, which has a lot of interesting ideas that I need to think about more on this front.

I think this is pretty elegant because we're using model-free control (exploration policy) to act as an importance sampling scheme for learning a model-based RL system (E2C), and the whole thing is coupled via the trendy notion of "dueling architectures".

Exploration without E2C information

A simpler option is that we don't involve E2C at all in the exploration policy, but instead let the exploration policy try to figure out the space on its own.

In the case of the plane control task, the entire state space is computable, so sampling is relatively easy. In general though, the topology of the state space is unknown, so we'd need to devise some kind of universal exploration function to properly characterize the space.

Not sure if there are existing approaches...

Stuff that didn't seem to work right away, but I'm still interested in pursuing:
• Multistep E2C - I'd like to measure the total prediction loss over a trajectory under a single-step E2C model trained for $2N$ iterations, and then compare  that to the same loss after a multistep model (initialized with E2C weights trained for $N$ iters) has been trained for an additional $N$ iterations.
• We can even combine multi-step E2C with sequential generative models between each time step. Here's where very deep capabilities of ResNets would be powerful...
Are there software for organizing exploratory research ideas such as these? I have the basic E2C model, and I want to know what happens when I extend it in various ways (add attention, use residual generation, multistep), and I'm also interested in what happens when I combine these all together. It's also important from an experimental standpoint to vary experimental parameters incrementally, otherwise I can't know for sure what a change does to the model. I know Domino Data Labs does something like this, but I think they're more focused on data science / analysis than model-building.