CS2980 Research Blog: April 2016

Thursday, April 21, 2016

Fixed gRPC compilation

After struggling to get gRPC working between C++/Python for about a week, I figured it out:

- need to overload Protobuf headers defined in /usr/local/include otherwise I get the "this protobuf header was compiled with a more recent version of protoc".
- in QT, I can add the helloworld cpp example directory (from gRPC) into the include path, but in order for this to work I also need to add the relevant headers I want to use into the directory.

I have C++ talking to Python now, time to serialize images and send them over.

Wednesday, April 20, 2016

some experimental results

Re-ran some experiments from yesterday using longer max step sizes and multistep E2C. I also changed the sampling scheme to pick trajectories and times without replacement (but it doesn't have much of an effect for this robot). Here are the latent space reconstructions:

Single-step E2C, max_step=1

Single-step E2C, max_step=10

Multi-step E2C, max_step=1

I tried re-running the multistep training initialized to the single-step max_step=1 weights, but the gradients immediately blew up.

Tuesday, April 19, 2016

Idea Dump

Here's a list of stuff I want to try:
Stuff I'm pretty sure will work:

Use spatial transformers to localize the agent before passing it to the encoder, so that we only encode state variables relevant to the robot pose (and not the entire system). Of course,
In the paper, the authors pass in the previous frame in addition to current frame. But what if instead we passed in the difference between the current and last frame (i.e. $x_t-x_{t-1}$)? Seems like the sparsity will make it easier to infer velocity.
Use ResNets for computing the transition network. For most reasonable systems, I hypothesize that it's easier to sample a perturbation $\Delta z$ and then compute $z_{t+1} = z_t + \Delta z$ rather than sampling $z_{t+1}$ unconditionally. This is just like using an additive canvas in DRAW.

Stuff I hope will work:

E2C uses the following pipeline:

Sample trajectory starting states $x_0$.
The exploration policy $\pi_\rho$ samples random control trajectories, and we feed these into the simulation with $x_0$ to generate action-state-transition tuples $(x_t,u_t,x_{t+1})$.
Use the E2C framework to learn the latent space & transition model.
Use the model + optimal trajectory algorithms to do planning.

For all $x_t$ achievable in simulation, we want our dataset to have samples $(x_t,u_t,x_{t+1})$ corresponding to all possible controls $u_t$ in the action space applied to that state.

Right now, my "exploration policy" doesn't use any information from the ongoing simulation. In fact, I just uniformly sample over the action space without any knowledge of the current observation.

Intuitively, we'd instead like to get the robot to interactively explore the state-action space as fully as possible, particularly regions where the reconstruction or prediction loss is high. During training, we want to draw more samples $x_t$ from the state space (I use $x$ as a proxy for "state" even though it actually means "observation" in E2C) that correspond to high reconstruction losses. Similarly, we want to draw more samples $(x_t,u_t,x_{t+1})$ that result in high prediction loss, so we can hopefully obtain more useful gradients.

This is akin to adaptive importance-sampling regions of high noise in path tracing!

We could compute $\nabla_\rho x_t$, and perhaps use that to query what states the robot should sample more of, but in general we cannot compute what constitutes a valid state, short of solving a trajectory optimization algorithm.

Instead, I propose an "exploration policy" that adaptively samples $u_t$ to encourage exploration of the space. There are some possible approaches:

We can do gradient ascent on the reconstruction loss $L_\text{bound}$ with w.r.t. $\rho$, i.e. $\rho := \rho + \nabla_\rho L_\text{bound}$, so the exploration policy proposes $u_t$'s that tend to increase E2C loss. The "plain English" explanation of this is that we want the robot to re-try actions that it failed to predict well for some previous state.
Of course, the loss is guaranteed to be increased only if we start from the same state $x_t$, but our new starting state has changed to $x_{t+1}$. I don't think there's much we can do about this, since in real life robot training we probably have to manually initialize starting states and let most of the data come from a single trajectory. But maybe $x_{t+1}$ is sufficiently close to $x_t$ that this doesn't matter...
It's easy to see how this can fail to converge if the robot is constantly chasing bad actions, overfitting, then un-learning all its previous understanding of the state space, so the composition of each training minibatch needs to contain some past samples.

At time $t$, we have our updated $\rho$ from the exploration policy and apply it immediately to the robot to obtain $(x_t,u_t,x_{t+1}$.

Adversarial Exploration Policy Gradient

Here's a better idea: at time $t$, we use $x_t$ to compute $u_t$, i.e. $\pi_\rho$ is an adversarial policy that attempts to maximize $L$ given $x_t$.

$$
u_t \sim \pi_\rho(\cdot|x_t) \\
\rho := \rho + \nabla_\rho L_\text{bound}
$$

Over $M$ simulation time steps, we accumulate $M$ samples of $(x_t,u_t,x_{t+1}$, which we then use to minibatch-update the $\pi_rho$ jointly with the E2C model (we'll need two copies of the net, one with single-batch and minibatch for passing $x_{t:t+M}$ through).

To prevent overfitting to the (highly) temporally-correlated M-trajectory, we let a fraction of the samples be drawn from an experience replay database. In this case, the adversarial policy also gets exposed to experience replay. I'm working through the paper Prioritized Experience Replay, which has a lot of interesting ideas that I need to think about more on this front.

I think this is pretty elegant because we're using model-free control (exploration policy) to act as an importance sampling scheme for learning a model-based RL system (E2C), and the whole thing is coupled via the trendy notion of "dueling architectures".

Exploration without E2C information

A simpler option is that we don't involve E2C at all in the exploration policy, but instead let the exploration policy try to figure out the space on its own.

In the case of the plane control task, the entire state space is computable, so sampling is relatively easy. In general though, the topology of the state space is unknown, so we'd need to devise some kind of universal exploration function to properly characterize the space.

Not sure if there are existing approaches...

Stuff that didn't seem to work right away, but I'm still interested in pursuing:

Multistep E2C - I'd like to measure the total prediction loss over a trajectory under a single-step E2C model trained for $2N$ iterations, and then compare that to the same loss after a multistep model (initialized with E2C weights trained for $N$ iters) has been trained for an additional $N$ iterations.
We can even combine multi-step E2C with sequential generative models between each time step. Here's where very deep capabilities of ResNets would be powerful...

Are there software for organizing exploratory research ideas such as these? I have the basic E2C model, and I want to know what happens when I extend it in various ways (add attention, use residual generation, multistep), and I'm also interested in what happens when I combine these all together. It's also important from an experimental standpoint to vary experimental parameters incrementally, otherwise I can't know for sure what a change does to the model. I know Domino Data Labs does something like this, but I think they're more focused on data science / analysis than model-building.

Fixed E2C Implementation Mistake

*facepalm

Just realized that I made an error in my E2C implementation. The transition dynamics in the latent space $Z$ are sampled as follows:
$$
z_t \sim N(\mu_t,\Sigma_t) \\
\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)\\
C_t=A_t\Sigma_t{A}_t^T
$$

Previously, I had come up with transformation of a sample $i \sim N(0,\Sigma_t)$ into the desired sample $\hat{z}_{t+1}$. While the sample is indeed drawn from the right distribution, a closer inspection shows that this is totally wrong approach from a physical interpretation:

In a locally-linear dynamical system, the next state is determined via $z_{t+1} = A_tz_t+B_tu_t+o_t$, where $A$ is the Jacobian w.r.t. autonomous (latent) state dynamics and $B$ is the Jacobian w.r.t. applied controls $u$.

I don't think there should be nothing deterministic about this transition process. $\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)$ is a random variable not because the transition dynamics are probabilistic, but actually because $z_{t+1}$ is a deterministic transformation of random variable $z_t$. In other words, we desire the variance of $||z_{t+1}-z_t||$ to be 0.

If we were re-sample $\hat{z}_{t+1}$ probabilistically, we are adding "extra noise into the transition" from the second sampling of $N(0,1)$. The variance of $z_{t+1}$ remains the same, but we've increased the variance over $||z_{t+1}-z_t||$, so the variance of our gradients is going to be greater. Another way to look at this is that we are "de-correlating" our sample of $z_t$ and our sample of $z_{t+1}$ when they should in fact be very-coupled. We are sampling $\hat{z}_{t+1}$ assuming that each sample of $z_t$ are exactly equal to the mean $\mu_t$.

The fact that $z_{t+1}$ should be a re-parameterized version of $z_t$ rather than re-sampled seems obvious in hindsight, but I just realized that this is the entire method by which the re-parametrization trick reduces variance. If we have a set of random variables $x_1, x_2, ... x_N$ we need to sample from, instead of sampling $N$ different noise functions, we sample a single $N(0,1)$ and re-use that noise sample to compute every random variable we need.

Fixing this bug makes the latent space images nicer. The below was trained for 2e5 iterations, with max step size = 3.

Still not perfect, but this was without any of the multistep / KL / large step-size tricks I had previously tried.

Tuesday, April 12, 2016

sequential E2C

Learning a sequence of $X$ simultaneously under the E2C framework leads to low loss values, but the space is no longer well-formed.

Learning a sequence of $\bar{X}$ conditioned on just $X_0$ performs even more poorly (further investigation needed):

"boxbot" worm simulation framework

I spent some time re-designing how I am going to interface the Box2D/Liquidfun physics simulations with TensorFlow, and in the process I wrote a somewhat generic framework for 2D robot simulation for Reinforcement Learning. I intend to clean up the codebase and release it as a separate project after this main capstone project is done.

I used Google Protobufs to define my robot and environment models, which can be used to express rigid-body link hierarchy coupled together by flexible muscles.

Thanks to Protobufs, I can define the robot models in Python and quickly export them to C++.

Here are a couple robots I made:

octopus arm with a fixed base:

Worm:

These two models are actuated via a pair of dorsal and ventral muscles between adjacent segments.

Friday, April 8, 2016

obstacle encodings

Added visualization for encodings of "invalid states" in the latent space. Invalid states are never reached in the training set, so they overlap with the distribution across "reachable" states.

Sunday, April 3, 2016

Better latent space unfolding

I figured out a way to achieve a nicer-looking latent space unfolding.

The trick is to increase the maximum distance that the agent is allowed to move per step (if it is too large, it will collide and stop at an obstacle).

Here is a still image:

Intuitively, the correct topology is formed by the "stitching" of continuous topological patches in latent space. Because we are uniformly sampling trajectories and not possible positions, the reconstruction fidelity is biased towards places where the agent is seen more often. The topology gets messed up if two adjacent points are difficult to transition between (such as a narrow passageway).

Consequently, increasing the step size "forces" the reconstructions to be meaningful not just between adjacent points, but also points up to some maximum distance away. In other words, instead of just stitching together adjacent points (corresponding to 1 unit away), points are also implicitly stitched to other points within some radius from the dataset samples.

Given this result, I hypothesize that a multi-step reconstruction loss (i.e. minimizing the reconstruction loss over trajectory forecasts, rather than single steps) will be better. I also think that this will improve performance during planning.

If the agent has $T=10$ horizon for use in iLQR, then it seems like a good idea for it's forecasting abilities to be accurate up to around $T=10$ rather than $T=1$. Otherwise, the costs computed far away may correspond to really blurry images.

But... actual reconstructions are substantially worse than the max_dist=1 step size:

There are number of parallel tasks for me at this point:

Multi-step (recurrent) E2C
Actual SOC algorithm (this can proceed now that I am reasonably confident about the fidelity of the latent space reconstruction)
Implement simulation framework for the worm

This simulation code will generate images that are stored in a database and used to train E2C offline.
Online training / adaptive exploration would be awesome, but I haven't thought too much about it yet.

LiquidFun Simulations

Hacked together a very simple muscle system in LiquidFun.

Basically each "muscle segment" has two axles (with limited rotation) and axles between segments are joined via prismatic joints (pistons).

LiquidFun supports particle-based soft bodies, and I got those to work, but unfortunately I don't think it supports attaching joints (actuation) to soft bodies. That's ok - rigid bodies and the above segments should be sufficient for reproducing even the results of the biped locomotion paper.

Saturday, April 2, 2016

Guided Latent Space Unfolding

Tried a slightly different environment map that resulted in a better-looking space.

The latent space seems to get pretty "knotted up" in areas where the robot doesn't explore as much. Currently the dataset's distribution is uniform over the trajectories explored by the robot (but not uniform in possible states in the world).

Does training over more samples of the robot in the blue region result in a better unfolding?

Can we do some kind of interactive importance-sampling of $x_t$ from the dataset? If the training loop is coupled to the simulation loop, can we get the current latent manifold to inform what areas to search over?

Anyway, I had this really random idea that I tried, which was to extend the E2C to not only produce 1 step prediction, but actually multi-step. The intuition is that maybe the error for 1 step is minimized, but as we march forward, the errors accumulate quickly at $t=2,3,...$, which we can squash via gradient descent. So far simply extending E2C to also produce a prediction for $x2$ given $u1,x1$ is not very helpful. The plateaued error is about twice that of single step one.

But what if I predict a longer sequence using the reconstruction as the input? I'd expect results to get very poor after a few steps due to the blurriness accumulating. Maybe we can squash that.