Monday, May 16, 2016

Adaptive Exploration on Plane Task

I got the Plane E2C task to work: the adaptive policy is more adept at quickly finding areas of high E2C loss.

Left column is database samples (fixed size of 3000 points), right column is marginal E2C loss (mean over all possible U for a given position).

Random policy:
Adaptive Policy:

I repeated this experiment on a different environment map, one with a narrow corridor.

Random policy:


Adaptive policy:


Notice how in both cases, the adaptive scheme's dataset covers the full training space sooner than the random policy.


Friday, May 13, 2016

Got the 1D simplified model to work

Adaptive exploration is working now, on the 1D task. I simplified the codebase (I think I introduced several bugs by trying to over-engineer the experimental setup and make it "generalize" for a variety of different simulation scenarios).

Environment
$x$ is a single scalar representing the robot's position on the interval $(0,1)$. It moves left and right according to applied controls $u \in (-1,1)$, and is damped by a potential function shown in the first column of the figure below: there is no damping at $x=0.64$ and the highest damping (least movement) occurs at $x=.2$.

Here, I simulated the robot for $T=1500$ steps. The second column depicts samples $x$ and the control we applied at that state $u$. The third column depicts this same trajectory as $x$ as a function of simulation steps.
Experimental setup

The dataset $D$ consists of 1500 values. In both simple/adaptive schemes, I let the robot wander around for 1500 steps to pre-populate the dataset ("burn-in"), then let it wander around for 30 more "episodes". In each episode 20% of the existing dataset is replaced by new data sampled by the policy.

This means that under both the random and adaptive policies, the robots are allowed to wander the same number of total steps (1500 + 30*150).

Uniform Random Policy:
Explanation of tableau:

rows: episodes 1-5.
columns:
Dataset - distribution of points used to train E2C and if applicable, the exploration policy.
Learned P(U|X) - the policy distribution after training on this cycle's dataset
Reconstruction Loss heatmap (X on x-axis, U on y-axis)
Prediction Loss heatmap (X on x-axis, U on y-axis)


I had previously drawn the L2 loss used to learn E2C, and the reconstructions looked pretty good. However, since $x \in (0,1)$ and the predicted motions are actually quite small, the L2 loss is very dim when visualized.





Adaptive Policy:

The adaptive model converges slightly faster, and is able to move the green points away from existing blue/green points. At a given episode, if the E2C loss at the blue/green regions is already 0 (i.e. via minimization of outer loop), then our updated policy will propose samples $u0$ that strictly increase the loss should it re-encounter those same $x$ values.

Green points correspond to newly-sampled points (via policy) and swapped into the dataset for that given cycle.


Something else that worked was to reset the exploration policy on each episode. The intuition here is that we want the policy to avoid samples that it just proposed in the previous episode (group of simulation steps), because we bootstrap off our previous policy, we'd probably pick similar points again in spite of optimization. So, we re-optimize the policy from an un-trained state on each episode, and just let the weights for the E2C model carry over.

However, it does seem a bit wasteful to have to re-learn everything from scratch on each episode. Need to think more about this.

I also realized that it's not possible to maximize $L_{e2c}$ directly with respect to $u$ - we cannot compute E2C loss without having access to the transition function. This also applies to sampling-based approaches where we pick a bunch of random $u$ and see which one incurs the highest E2C loss - this is impossible b.c. again, we can't compute E2C loss for an un-observed sample.




Thursday, May 12, 2016

Adaptive AE toy task

E2C is too complicated to deal with, so I simplified the model and am just using a simple autoencoder to do prediction and reconstruction. No dynamics or anything fancy.

Modified the potential function, with datasets deliberately occupying fewer samples.

Random policy



Adaptive policy



Note: the adaptive policy uses random noise, but it seems to perform poorly when sampleNormal $e \sim N(0,1)$ noise. Further investigation needed.


Couple notes to self:

  • When there are issues with "need to feed placeholders", it's likely b.c. you forgot to set var_list in optimizer.
  • These simple models train faster on CPU than GPU.

Wednesday, May 11, 2016

Adversarial Exploration troubles...

I've been running into issues getting "adversarial exploration" to work. I'm not certain if it's a programming bug (something wrong with my tensorflow implementation or the way Python modules are imported into each other) or a theoretical one (getting trapped in a sampling region and the local gradients preventing it from escaping). In any case, it's not working as well as I thought it would, even for the 1D case.

The idea of this "adversarial exploration policy" is to alternate between exploring the state-action space and training E2C, using our existing samples and the current E2C model to decide where to go. The basic idea is to importance-sample observation-action mini-batches according to the distribution of the E2C model's objective function. This prioritizes experiences to favor training samples with high loss, leading to faster convergence of model learning with respect to agent exploration time.

Let $X = (x_i,u_i,x_{i+1})$ be a random variable representing a single tuple in the state-action-next_state space.

For each new episode $c$ where the current state observation is $x_{c-1}$, we solve the minimax problem

$$
\min_{\phi,\psi,\theta} \max_{\pi}\mathbb{E}_{\mathcal{D}_c}{[L_{\text{E2C}}|\mathcal{D}_{c-1}]}
$$
where
$$
\mathcal{D}_c = \mathcal{D}_{c-1} \cup \{(x_{c-1}, u_{c-1}, f(x_c))\}\\
u_{c-1} \sim P_\pi(\cdot | x_{c-1})
$$

What dataset should $\mathcal{D}_c$ converge to? The densities shouldn't matter, as long as E2C reconstruction is robust for whatever planning task we want to do (of course, performance degrades if we spend time exploring rare regions where the robot is unlikely to end up).

Right now I am optimizing $\pi$ over samples drawn from $\mathcal{D}_c$, as I should be for optimizing the inner loop of minimax. I could instead take a fast algorithm that just evaluates the E2C model for latin hypercube / orthogonal sampling over the action space, and just choose the action that results in the highest "predicted" E2C loss. This would be more akin to a policy-free search that simply chooses regions of high loss every step, rather than trying to learn a "value function" across the entire observation space implicitly.

Fixed a weird bug where separating out my Variational autoencoder into a separate module resulted in a computational graph that still trained but *very poorly* (loss was not decreasing). I still do not know why this happened: perhaps a python variable passed as a reference got modified somewhere?

Here are some tableau's I constructed. For some reason the reconstruction line is flat - all the X values are being mapped to the same values.


random sampling.

Stochastic policy seemed to be pretty important, the adaptive results are somewhat promising, though significantly worse than I'd have hoped.




Here's the reconstruction on the plane task



In order to demonstrate that adaptive E2C works, I would need to try a even simpler VAE or AE reconstruction of 1D input. 

Thursday, April 21, 2016

Fixed gRPC compilation

After struggling to get gRPC working between C++/Python for about a week, I figured it out:

- need to overload Protobuf headers defined in /usr/local/include otherwise I get the "this protobuf header was compiled with a more recent version of protoc".
- in QT, I can add the helloworld cpp example directory (from gRPC) into the include path, but in order for this to work I also need to add the relevant headers I want to use into the directory.

I have C++ talking to Python now, time to serialize images and send them over.

Wednesday, April 20, 2016

some experimental results

Re-ran some experiments from yesterday using longer max step sizes and multistep E2C. I also changed the sampling scheme to pick trajectories and times without replacement (but it doesn't have much of an effect for this robot). Here are the latent space reconstructions:

Single-step E2C, max_step=1


Single-step E2C, max_step=10

Multi-step E2C, max_step=1
I tried re-running the multistep training initialized to the single-step max_step=1 weights, but the gradients immediately blew up.




Tuesday, April 19, 2016

Idea Dump

Here's a list of stuff I want to try:
Stuff I'm pretty sure will work:
  • Use spatial transformers to localize the agent before passing it to the encoder, so that we only encode state variables relevant to the robot pose (and not the entire system). Of course,
  • In the paper, the authors pass in the previous frame in addition to current frame. But what if instead we passed in the difference between the current and last frame (i.e. $x_t-x_{t-1}$)? Seems like the sparsity will make it easier to infer velocity.
  • Use ResNets for computing the transition network. For most reasonable systems, I hypothesize that it's easier to sample a perturbation $\Delta z$ and then compute $z_{t+1} = z_t + \Delta z$ rather than sampling $z_{t+1}$ unconditionally. This is just like using an additive canvas in DRAW.
Stuff I hope will work:

E2C uses the following pipeline:
  1. Sample trajectory starting states $x_0$.
  2. The exploration policy $\pi_\rho$ samples random control trajectories, and we feed these into the simulation with $x_0$ to generate action-state-transition tuples $(x_t,u_t,x_{t+1})$. 
  3. Use the E2C framework to learn the latent space & transition model.
  4. Use the model + optimal trajectory algorithms to do planning.
For all $x_t$ achievable in simulation, we want our dataset to have samples $(x_t,u_t,x_{t+1})$ corresponding to all possible controls $u_t$ in the action space applied to that state.

Right now, my "exploration policy" doesn't use any information from the ongoing simulation. In fact, I just uniformly sample over the action space without any knowledge of the current observation.

Intuitively, we'd instead like to get the robot to interactively explore the state-action space as fully as possible, particularly regions where the reconstruction or prediction loss is high. During training, we want to draw more samples $x_t$ from the state space (I use $x$ as a proxy for "state" even though it actually means "observation" in E2C) that correspond to high reconstruction losses. Similarly, we want to draw more samples $(x_t,u_t,x_{t+1})$ that result in high prediction loss, so we can hopefully obtain more useful gradients.

This is akin to adaptive importance-sampling regions of high noise in path tracing!

We could compute $\nabla_\rho x_t$, and perhaps use that to query what states the robot should sample more of, but in general we cannot compute what constitutes a valid state, short of solving a trajectory optimization algorithm.

Instead, I propose an "exploration policy" that adaptively samples $u_t$ to encourage exploration of the space. There are some possible approaches:
  • We can do gradient ascent on the reconstruction loss $L_\text{bound}$ with w.r.t. $\rho$, i.e. $\rho := \rho + \nabla_\rho L_\text{bound}$, so the exploration policy proposes $u_t$'s that tend to increase E2C loss. The "plain English" explanation of this is that we want the robot to re-try actions that it failed to predict well for some previous state.
  • Of course, the loss is guaranteed to be increased only if we start from the same state $x_t$, but our new starting state has changed to $x_{t+1}$. I don't think there's much we can do about this, since in real life robot training we probably have to manually initialize starting states and let most of the data come from a single trajectory. But maybe $x_{t+1}$ is sufficiently close to $x_t$ that this doesn't matter...
  • It's easy to see how this can fail to converge if the robot is constantly chasing bad actions, overfitting, then un-learning all its previous understanding of the state space, so the composition of each training minibatch needs to contain some past samples. 
At time $t$, we have our updated $\rho$ from the exploration policy and apply it immediately to the robot to obtain $(x_t,u_t,x_{t+1}$.

Adversarial Exploration Policy Gradient

Here's a better idea: at time $t$, we use $x_t$ to compute $u_t$, i.e. $\pi_\rho$ is an adversarial policy that attempts to maximize $L$ given $x_t$.

$$
u_t \sim \pi_\rho(\cdot|x_t) \\
\rho := \rho + \nabla_\rho  L_\text{bound}
$$

Over $M$ simulation time steps, we accumulate $M$ samples of $(x_t,u_t,x_{t+1}$, which we then use to minibatch-update the $\pi_rho$ jointly with the E2C model (we'll need two copies of the net, one with single-batch and minibatch for passing $x_{t:t+M}$ through).

To prevent overfitting to the (highly) temporally-correlated M-trajectory, we let a fraction of the samples be drawn from an experience replay database. In this case, the adversarial policy also gets exposed to experience replay. I'm working through the paper Prioritized Experience Replay, which has a lot of interesting ideas that I need to think about more on this front.

I think this is pretty elegant because we're using model-free control (exploration policy) to act as an importance sampling scheme for learning a model-based RL system (E2C), and the whole thing is coupled via the trendy notion of "dueling architectures".

Exploration without E2C information

A simpler option is that we don't involve E2C at all in the exploration policy, but instead let the exploration policy try to figure out the space on its own.

In the case of the plane control task, the entire state space is computable, so sampling is relatively easy. In general though, the topology of the state space is unknown, so we'd need to devise some kind of universal exploration function to properly characterize the space.

Not sure if there are existing approaches...

Stuff that didn't seem to work right away, but I'm still interested in pursuing:
  • Multistep E2C - I'd like to measure the total prediction loss over a trajectory under a single-step E2C model trained for $2N$ iterations, and then compare  that to the same loss after a multistep model (initialized with E2C weights trained for $N$ iters) has been trained for an additional $N$ iterations.
  • We can even combine multi-step E2C with sequential generative models between each time step. Here's where very deep capabilities of ResNets would be powerful...
Are there software for organizing exploratory research ideas such as these? I have the basic E2C model, and I want to know what happens when I extend it in various ways (add attention, use residual generation, multistep), and I'm also interested in what happens when I combine these all together. It's also important from an experimental standpoint to vary experimental parameters incrementally, otherwise I can't know for sure what a change does to the model. I know Domino Data Labs does something like this, but I think they're more focused on data science / analysis than model-building.


Fixed E2C Implementation Mistake

*facepalm

Just realized that I made an error in my E2C implementation. The transition dynamics in the latent space $Z$ are sampled as follows:
$$
z_t \sim N(\mu_t,\Sigma_t) \\
\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)\\
C_t=A_t\Sigma_t{A}_t^T
$$

Previously, I had come up with transformation of a sample $i \sim N(0,\Sigma_t)$ into the desired sample $\hat{z}_{t+1}$. While the sample is indeed drawn from the right distribution, a closer inspection shows that this is totally wrong approach from a physical interpretation:

In a locally-linear dynamical system, the next state is determined via $z_{t+1} = A_tz_t+B_tu_t+o_t$, where $A$ is the Jacobian w.r.t. autonomous (latent) state dynamics and $B$ is the Jacobian w.r.t. applied controls $u$.

I don't think there should be nothing deterministic about this transition process. $\hat{z}_{t+1} \sim N(A_t\mu_t+B_tu_t+o_t|C_t)$ is a random variable not because the transition dynamics are probabilistic, but actually because $z_{t+1}$ is a deterministic transformation of random variable $z_t$. In other words, we desire the variance of $||z_{t+1}-z_t||$ to be 0.

If we were re-sample $\hat{z}_{t+1}$ probabilistically, we are adding "extra noise into the transition" from the second sampling of $N(0,1)$. The variance of $z_{t+1}$ remains the same, but we've increased the variance over $||z_{t+1}-z_t||$, so the variance of our gradients is going to be greater. Another way to look at this is that we are "de-correlating" our sample of $z_t$ and our sample of $z_{t+1}$ when they should in fact be very-coupled. We are sampling $\hat{z}_{t+1}$ assuming that each sample of $z_t$ are exactly equal to the mean $\mu_t$.

The fact that $z_{t+1}$ should be a re-parameterized version of $z_t$ rather than re-sampled seems obvious in hindsight, but I just realized that this is the entire method by which the re-parametrization trick reduces variance. If we have a set of random variables $x_1, x_2, ... x_N$ we need to sample from, instead of sampling $N$ different noise functions, we sample a single $N(0,1)$ and re-use that noise sample to compute every random variable we need.

Fixing this bug makes the latent space images nicer. The below was trained for 2e5 iterations, with max step size = 3.




Still not perfect, but this was without any of the multistep / KL / large step-size tricks I had previously tried.

Tuesday, April 12, 2016

sequential E2C

Learning a sequence of $X$ simultaneously under the E2C framework leads to low loss values, but the space is no longer well-formed.





Learning a sequence of $\bar{X}$ conditioned on just $X_0$ performs even more poorly (further investigation needed):


"boxbot" worm simulation framework

I spent some time re-designing how I am going to interface the Box2D/Liquidfun physics simulations with TensorFlow, and in the process I wrote a somewhat generic framework for 2D robot simulation for Reinforcement Learning. I intend to clean up the codebase and release it as a separate project after this main capstone project is done.

I used Google Protobufs to define my robot and environment models, which can be used to express rigid-body link hierarchy coupled together by flexible muscles.

Thanks to Protobufs, I can define the robot models in Python and quickly export them to C++.

Here are a couple robots I made:

octopus arm with a fixed base:

  
Worm:

These two models are actuated via a pair of dorsal and ventral muscles between adjacent segments.



Friday, April 8, 2016

obstacle encodings

Added visualization for encodings of "invalid states" in the latent space. Invalid states are never reached in the training set, so they overlap with the distribution across "reachable" states.


Sunday, April 3, 2016

Better latent space unfolding

I figured out a way to achieve a nicer-looking latent space unfolding.

The trick is to increase the maximum distance that the agent is allowed to move per step (if it is too large, it will collide and stop at an obstacle).

Here is a still image:


Intuitively, the correct topology is formed by the "stitching" of continuous topological patches in latent space. Because we are uniformly sampling trajectories and not possible positions, the reconstruction fidelity is biased towards places where the agent is seen more often.  The topology gets messed up if two adjacent points are difficult to transition between (such as a narrow passageway).

Consequently, increasing the step size "forces" the reconstructions to be meaningful not just between adjacent points, but also points up to some maximum distance away. In other words, instead of just stitching together adjacent points (corresponding to 1 unit away), points are also implicitly stitched to other points within some radius from the dataset samples.

Given this result, I hypothesize  that a multi-step reconstruction loss (i.e. minimizing the reconstruction loss over trajectory forecasts, rather than single steps) will be better. I also think that this will improve performance during planning.

If the agent has $T=10$ horizon for use in iLQR, then it seems like a good idea for it's forecasting abilities to be accurate up to around $T=10$ rather than $T=1$. Otherwise, the costs computed far away may correspond to really blurry images.

But... actual reconstructions are substantially worse than the max_dist=1 step size:



There are number of parallel tasks for me at this point:

  • Multi-step (recurrent) E2C
  • Actual SOC algorithm (this can proceed now that I am reasonably confident about the fidelity of the latent space reconstruction)
  • Implement simulation framework for the worm
    • This simulation code will generate images that are stored in a database and used to train E2C offline. 
    • Online training / adaptive exploration would be awesome, but I haven't thought too much about it yet.

LiquidFun Simulations

Hacked together a very simple muscle system in LiquidFun.

Basically each "muscle segment" has two axles (with limited rotation) and axles between segments are joined via prismatic joints (pistons).

LiquidFun supports particle-based soft bodies, and I got those to work, but unfortunately I don't think it supports attaching joints (actuation) to soft bodies. That's ok - rigid bodies and the above segments should be sufficient for reproducing even the results of the biped locomotion paper.



Saturday, April 2, 2016

Guided Latent Space Unfolding

Tried a slightly different environment map that resulted in a better-looking space.



The latent space seems to get pretty "knotted up" in areas where the robot doesn't explore as much. Currently the dataset's distribution is uniform over the trajectories explored by the robot (but not uniform in possible states in the world).

Does training over more samples of the robot in the blue region result in a better unfolding?

Can we do some kind of interactive importance-sampling of $x_t$ from the dataset? If the training loop is coupled to the simulation loop, can we get the current latent manifold to inform what areas to search over?

Anyway, I had this really random idea that I tried, which was to extend the E2C to not only produce 1 step prediction, but actually multi-step. The intuition is that maybe the error for 1 step is minimized, but as we march forward, the errors accumulate quickly at $t=2,3,...$, which we can squash via gradient descent. So far simply extending E2C to also produce a prediction for $x2$ given $u1,x1$ is not very helpful. The plateaued error is about twice that of single step one.

But what if I predict a longer sequence using the reconstruction as the input? I'd expect results to get very poor after a few steps due to the blurriness accumulating. Maybe we can squash that.

Thursday, March 31, 2016

On Discriminative Regularization

The paper "Discriminative Regularization for Generative Models" makes an important insight - we can use the output of a classifier as a regularization term within the objective function for training generative models.

Other places where this "multi-task regularization" comes up:

  • Learning Physical Intuition of Block Towers by Example - in addition to "predicting where the blocks will end up", the network is also asked to predict a binary "will the blocks fall?" In a hand-wavy sense, this acts as a way to "force" the internal representation to be useful for classifying an important "high-level" feature: namely, whether the blocks fell or not. As humans, we tend to be more interested in whether the blocks fell or not than the exact L2 distance between where the blocks fell and where we thought they would fall.

Indeed, the images produced by E2C (and DRAW, to some extent!) appear blurry. D.R.



How might we apply D.R. to E2C?


Human perception is goal-driven. The objective term corresponding to the fidelity of latent space encoding/decoding should not only be penalized by deviations in prediction (i.e. binary CE between x, x_recons) but also be penalized in some "goal-space":

Some ideas;
  • Binary variable of whether agent is in free fall or not (to borrow directly from Lerer et al. 2016). Only applies to situations with gravity I think.
  • Scalar "value function" to predict the cost of a random T=10 horizon trajectory. We can compute this only using a random sample of controls, so this is actually fairly easy to sample. Given $x_t$, $u_{t},u_{t+1},...,u_{t+T} \in U$ we hallucinate $\hat{X}_{t:t+T}=\{\hat{x}_{t+1},\hat{x}_{t+2},...,\hat{x}_{t+T}\}$. Then we feed $\hat{X}_{t:t+T}$ (T images) into some big convnet (not unlike the Atari-playing game) and use that to compute the cost. Meanwhile, we can run the simulator on the actual controls $u_{t},u_{t+1},...,u_{t+T} \in U$ and use that to obtain $X_{t:t+T}$ and the true costs $J$ of the local trajectory. Of course, we don't want to bias the representation towards a very specific cost function (i.e. a very specific goal state) so we could sample a variety of goal images.
  • Approximate the value $c(z_t,u_t)$ for a randomly sampled $z_{\text{goal}}$.
  • Binary variable - "is the agent touching an obstacle?".
I'm currently working my way through "One-Shot Generalization in Deep Generative Models" has some really interesting ideas whose theoretical foundations I don't fully understand, but I feel will give me a much better grasp on understanding how attention mechanisms might help with improving E2C performance.

E2C is working... sort of

Finally got E2C implementation working.



Here's the visualization of the latent space of the plane task (moving randomly around in a simple 2D maze) unfolding over time for the first few iterations.

For the first ~5000 iterations, the latent space is completely mixed, but the colors separate somewhat during training.

Unfortunately, the results do not look as pretty as those of the paper (latent space becomes nearly identical to true space), but perhaps it is because my data points are less densely populated at the bottom of the image? I.e. the state space is not uniformly sampled (the trajectories sample more from the top region than the bottom one.

The next steps:

  • Get iLQG control working with this agent. The task space hasn't properly unfolded, which is kind of concerning. 
  • Resume work on Box2D + LiquidFun implementation of muscle robot.
A rather daunting difficulty right now is the static nature of the observations. The environment isn't moving, and I'm uncertain how "accurate" the latent manifold will be in the presence of visual noise.

For example, can I get the worm to cross a moving platform across a chasm?

Wednesday, March 30, 2016

Reparameterization trick for Q_psi

Killed 2 bugs with one stone - turns out that I don't need to compute $C_t=A\Sigma A^T+H$ in order to generate a sample from it!

To sample $z_{t+1} \sim N(A\mu+Bu+o, A\Sigma A^T+H)$, let $\omega\sim N(0,H)$ and $x \sim N(0,\Sigma)$. $\Sigma$ is diagonal so we can sample $x$ using the ordinary reparameterization trick outlined in the VAE paper.

Let $y=Ax+\omega$.

Then we have
$$\mu_y=E[Ax+\omega]=E[Ax]+E[\omega]=A*0+0=0$$
$$\Sigma_y=\text{Var}[Ax+\omega]=\text{Var}[Ax]+\text{Var}[\omega]=A\Sigma A^T + H$$

Which yields us the exact covariance we want!

Then we can just compute $z_{t+1}=A\mu+Bu+o+ y$. Note that we could have centered $x$ at $\mu$, in which $\mu_y=A\mu$, to which we just add $Bu+o$, but it turns out that we need to compute $\mu_{z_{t+1}}$ anyway for measuring the KL divergence later.

Latent Loss Term

When I train the model to just minimize L_x, and then add the L_z term later and re-train, the network learns fine. However, when I try to minimize both L_x+L_z right off the bat, the gradient blows up.

Fixed it - I had L_z implemented incorrectly (wrong sign somewhere).

Now it trains with both L_z + L_x.

Figured it out

Fixed the VAE bug. 

Turns out that my VAE model was correctly implemented, but during testing I was forgetting to invert the images I was passing in (that's what I was doing during training). Of course, this messes everything up. The reconstruction losses I was obtaining was actually converging very close to 0, which was an accurate reflection of what was going on during training.

The issue was my sloppy programming skills. I could have avoided this mistake (and saved a couple days of grief) by simplifying the data interface to avoid having to manually process it after calling sample().

Sent an email to Manuel Watter asking about the KL divergence business and some other questions. Here's what he says:
1.  You're right, the KL can explode. Ideally, the network should stay away from the problematic values for v and r, but that's of course not guaranteed. John Assael's approximation is only valid if, as you said, the is only insignificant covariance between the state dimensions, i.e. small perturbations of v and r. This becomes problematic for example in the case of velocities, because they obviously influence the position. The network might choose a state representation that minimizes this influence then, but it's hard to say.
2. Both of your tests should work here and I think the MNIST reconstructions look not too bad. In 10 dimensions, he really can't do more than represent an average version of the numbers and since there's no label information, some of those get blurred together. For the maze on the other hand, he should be able to give almost perfect reconstructions, or at least get the positions of the agent right. From your settings, the minibatch size seems a little odd. 1000 is pretty much, try 128 or 64. I'm not sure about the beta1 value of Adam, but I think we always had 0.1.
About inverting the images: For our training, the background is always 0 and the agent 1, we just switched it for the visualization. The diffculty here is the sparse gradient information, which in the past lead to the agent information being lost, but Adam should solve that issue.
To make sure, that this really works, try to create a dataset which has an image for every possible position of the agent and sample your minibatches from that. This has to work for the VAE and you can use it to visualize the latent spaces. You might also try to compare your implementation to e.g Lasagne to find out where it goes wrong.
3. Correct, that's a mistake :/
4. Yes, we didn't specify anything here. We assume that the inference network handles all the noise.
Here's a VAE reconstruction of *just* the images:




Next steps:

  • Implement a way to visualize the latent space of trained models as a mapping of it's "true" task space. I will need a "Variational Model" class of some sort to be able to abstract this.
  • Re-implement the VAE task with E2C-type modules.
  • I'm thinking of adding a logarithmic barrier to guard against v^r < -1 condition. Maybe that will help?

Tuesday, March 29, 2016

Project Update

My E2C implementation is having some issues that I've been trying to debug over the last few days. Basically the training process is unstable (I'm using the same random seed).


  • Part of the ratio of determinants in the computation of the KL divergence between $N(\mu_0,A\Sigma_0A^T)$ includes a $log(1+v^Tr)$ term. $vr^T$ is the perturbation of  of an identity matrix to create $A$, which is used in the linearized dynamics of the robot. Vectors $v$ and $r$ are generated using an affine transformation of $h_\psi^\text{trans}$, which implies that $v^Tr$ could be less than -1, in which the log will cause an underflow. I'm not quite sure how to get around this. The Torch implementation of E2C seems to compute the ratio of determinants via torch.log(q.sigma):sum() - torch.log(p.sigma):sum() but that assumes A to be diagonal. Perhaps it's okay to assume A is diagonally-dominant, since $vr^T$ is supposed to be a small perturbation?
  • My plane dataset implementation generates an independent set of obstacles for each trajectory, but this is incorrect. z_dim=2, which intuitively is "just enough" to encode the position of the robot. But we are asking the autoencoder to learn how to represent the obstacle positions as well. The obvious solution is to modify the task so that the obstacle state is fixed across all samples, and is "baked" into the network weights of the autoencoder during training.
  • Originally I had the robot as a black square on a white background (obstacles black). It turns out that white robot and obstalces on a black background work better.
  • Using GradientDescent instead of Adam prevented blowup (not sure if it has to do with the Adam $\beta_1$, $\beta_2$ parameters)
  • E2C is too complicated to debug, and I think there are issues with my KL divergence term implementation. To simplify things, I implemented a simple VAE. It works just fine on the MNIST task (naturally when z=2, network doesn't seem to be able to learn efficient encodings for all 10 digits), but fails to accurately encode the position of the robot. I tried downscaling the 40x40 plane task to 28x28 (same size as mnist) but performance is still bad. 
  • L_z term is blowing up even in VAE.
  • Perhaps VAE is not learning plane task properly because weights are not convolutional and the spatial semantics are lost? I will try robot size of 1 pixel on the blank plane dataset and see if that learns properly.

Tuesday, March 22, 2016

Week 3 Project Update

Stuff I did this week:

  • I've implemented the E2C model and the (synthetic) dataset for the 2D plane navigation task. There are some bugs in the code that I need to work out, but I have a pretty good grasp on the paper now.
  • I've started implementing a couple stochastic optimal control algorithms (iLQG, AICO) from this paper. I've written them in MATLAB but will translate to Python later for actual simulation.
  • I have a better formulation for the project I am trying to tackle, which I describe below:
In the Muscle-based Locomotion Paper, the robot is controlled by determining robot joint angles (corresponding to locomotive pose), then using a modified Jacobian-transpose controller to determine muscle activations that approximately effect the corresponding joint torques. This might be suitable for generating walking animations, but is not applicable to real robotics for the following reasons:
  • The controller is not model-free: i.e. we need explicit to explicitly model rigid-body dynamics before we can solve for muscle activation. If our robot has deforming parts (like an octopus or even a squishy backpack), this gets difficult.
  • Robots are not able to accurately measure their localization, pose, and other state variables precisely, which makes planning difficult.

For my project, I propose Model-Free Stochastic Optimal Control for Muscle-Based Robots.

The construction is as follows:

  1. Given a fixed robot body plan (muscle routing) and environment (same as task environment), generate a dataset $D$ of initial state observations (images), random action by the robot, and resulting state transition (another image).
  2. Train E2C on $D$ to learn dynamics of how the muscle system interacts with the surrounding environment.
  3. Use SOC algorithm to perform optimal control (moving from one side of the floor to the other). The robot does not have proprioceptive access to it's "true" state; planning only utilizes an inferred state from it's observations of the environment.

This is useful because we can use this framework to control soft robots in a model-free manner, letting the robot learn it's own body dynamics implicitly.

I hope to finish this in the next two weeks (with the majority of implementation taking place over Spring Break). The remainder of the semester will be examining how to scale up the model to a more complex robot (bipedal walker, 3D, end-to-end refinement), as well as examining the topological latent space formed by E2C.

E2C Notes Part II

I hope to get the E2C model working on the toy 2D plane problem and inverted pendulum task by the end of this weekend.

Here's what I learned from reading the E2C paper a second time, this time thinking more carefully about implementation details. Please excuse typos in this post - I didn't proofread super carefully.

Problem Formulation for E2C Latent-Space Generative Model

The E2C model's job is: "given the previous system state observation $x_t$ of the system, predict the future system state observation $x_{t+1}$, if the robot were to apply robot controls $u_t$.

Therefore, each training sample is a tuple of the form $(x_t,u_1,x_{t+1}) \in D$. $x_{t+1}$ is the "label" that we use to supervise learning of $Q_\psi(\hat{Z}|Z,u)$.

If each observation $x_t$ is an image, then the robot doesn't have a way of inferring the velocity of moving objects, which is pretty important in extrapolating what happens in the future.

In the paper, the authors fix this by bundling $x_{t-1}$ with $x_t$ so the robot has the opportunity to extract velocity information. In the paper, each training sample tuple is actually formulated as:
$$(x_{t-1},x_t,u_t,x_{t+1})$$

$[x_{t-1},x_t]$ represents some system state that is in motion (some unknown control $u_{t-1}$ was applied to $x_{t-1}$, taking the system to $x_t$. $u_t$ is the "proposed control", and $x_{t+1}$ is the same supervised label. Note that since this is a Markov decision process, we need not encode $u_{t-1}$ because the future $x_{t+1}$ only depends on $x_t,u_t$ (we treat the system dynamics as a Markov process).

Ideas:

Instead of concatenating the previous image and letting inference learn the motion relationship, why not just compute the (optical) flow ($x_{t+1} - x_t$) and pass that in as a feature? Maybe that makes velocity / collision planning easier to learn?

In the mammalian nervous system, rods and cones do not communicate with the brain in a synchronous manner. Instead, portions of the image stream at time $t$ are perceived with a slight lag, and may not end up at the "prediction" layer until a later time $t+\Delta{t}$. Perhaps the brain can takes advantage of this "sampling across the time domain" asynchrony in order to infer optical flow.

We might make E2C do this by making the model recurrent for a short time sequence, like the DRAW paper. E2C operates over a sequence of inputs, and may learn to integrate / memorize an early input $x_0$ in order to compute an optical flow at $t=1$ using $x_1$ and $x_0$.

It'd be really cool to combine E2C with attention / temporal integration capabilities.

Step 1: constructing training dataset $D$

The training set $D$ is pre-computed (via a physics simulator) and then used to teach E2C how to predict future latent states of the system. Note that training $E2C$ involves no trajectory optimization, physics simulation, or robotic control -- just learning "precognition".


Note that we can reduce the disk size of our training dataset by sampling $(x_{t-1},x_t,u_t,x_{t+1})$ from randomly chosen intervals in continuous trajectories. The $x_{t+1}$ in one datapoint can be re-used as the $x_{t-1}$ or $x_t$ in another sample.

So the dataset is computed by initializing a reasonable starting state, then applying a bunch of random controls to it and letting the robot/system "flop around".

I sent an email to Manuel Watter and he confirmed that this is indeed the case.

I can think of a couple issues though:

If we sample $u \in U$ uniformly, we need to have "good coverage" over the range of all possible controls we would want the robot to be able to predict behavior over. If we want the network to be able to accurately predict the future for arbitrary control sampled from the full movement space $U$, then the training data necessarily grows exponentially with the degrees of freedom in the robot. In fact, it really grows with all pairwise combinations of state and control: $S \times U$, which is really depressing [1].

There's another large problem, which is the case in which the system is not ergodic. If gravity or entropy are involved in our dynamical system, this is very likely to be the case: the robot could fall down a gorge and any control $u$ in that state will result in the same future state: "stuck". Watter agrees with my hypothesis, but they haven't tried this.
Sampling a bunch of different initial states helps, but there's still the problem that a large space of data tuples consist of "stuck" predictions. If our dataset is even slightly biased towards being "stuck", then our robot will tend to predict "stuck" states.

Step 2: Training E2C on D

If training is successful (that's a big IF), E2C learns the Markov transition function of the latent space dynamics. Now we can hallucinate trajectories without actually simulating dynamics in the simulator. Given $x_0$, we can predict $z_1,...,z_T$, as well as the decoded images $x_1,x_2,...,x_T$.

One thing I'm still trying to figure out is how to compute $H_t$, the covariance that represents "system noise". The paper mentions it a couple times but doesn't give a clear definition. Is that a free parameter that we choose?

Step 3: Control

This is where the trained E2C is used to hallucinate state trajectories for use in. I'm currently working through the AICO paper.


[1] Humans have the ability to predict/navigate dynamical states $s$ and observations $x$ that they have never seen before. I imagine that an adult human's "training data" consists of a pretty comprehensive sampling of the space $U$ (available motions), but a very limited sampling of the space $X$.

If humans were using E2C to perform control, and is able to predict the future for a given $x \in X$, then that would imply that although $x$ may be novel, the human has actually seen a very similar (or exact) latent representation $z = m(x)$ before ("same problem, different wording"). The learned transition function for certain latent variables, like our intuition of "object permanence", "gravity", "bipedal walking" seere-construct m to be extremely well-developed, so although we may not have been in a particular scenario, we can still predict core pieces of the next observation using these "robust transition primitives".

Replicating this in E2C requires some big advances in unsupervised learning.
Concretely, we need to (1) learn some set of near-universal latent space dynamics that persist no matter what observation we are looking at, and (2) fill in "sparse information" using these "universal latent dynamics" and whatever available information we are partially familiar with.

Then, we somehow combine our sparse observation predictions to construct a dense observation prediction $x_{t+1}$, even though we never observed anything like $x_t$ before. This might solve our problem of exponential search space explosion.

 [2] Random pondering: we can predict very high-dimensional systems (stock prices, all of our muscles, ). Our prediction abilities decreases as a function of "information distance" with respect to what we can perceive. Perhaps we not only learn a latent representation of world state, but also a latent representation of world state uncertainty.


Implementation Progress


I've constructed the dataset for the 2D plane navigation task. On the left is the starting state, on the right is the ending state after the robot (grey dot) has randomly moved one pixel to the left or right ($u$ value not shown).

I've also started implementation of the E2C model, and implemented basically everything except the loss nodes (the tricky part).

First Post: Project Proposal

The goals of this project are to:
  1. Implement muscle-based control of robots using techniques from Embed-to-Control.
  2. Examine the topological properties of the latent state space dynamics involved in robotic control.
I intend to draw ideas from the following papers:

Project Goals/Deliverables:

  • 2D simulation of a simple muscle-based worm with analysis of latent space dynamics found via E2C model.
  • Same as above, but with more complex robot (biped walker).
  • Generalization to 3D(reach target).

Possible Extensions:

  • Gallery of learned state spaces for various control problems / robot body plans.
  • Add spatial read/write attention to the model to reduce complexity burden that the transition network needs to learn (inspired by the DRAW paper).

Background (Robotics/RL Vocabulary and Problem Formulation)

I'll attempt to explain some of the vocabulary I learned from reading A Survey on Policy Search for Robotics. I should warn the reader that I'm learning this as I go, and my understanding of how these pieces fit together semantically may be incomplete / incorrect.

Robots operate in a state space $x$ that consists of internal states (joint angles, velocities, end effector position) as well as external states (e.g. the environment).

The robot chooses motor commands $u$ according to a control policy $\pi$. A Control Policy is a function $\pi$ that outputs action $u$ given the state of system $x_t$ at time $t$. $\pi$ can be stochastic ($\pi(u_t|x_t)$) or deterministic $(u=\pi(x_t))$

The motor commands $u$ alter the state of the world (including the robot), so the state at the next discrete time step $t+1$ is thought of as being drawn from a probability distribution $p(x_{t+1}|x_t,u_t)$.

The joint states and actions of the robot across time form a trajectory $\tau=(x_1,u_1,...,x_T,u_T)$. For a stochastic policy, the trajectory can be formulated as a joint distribution over $(x_1,u_1,...,x_T,u_T)$.

Traditional Reinforcement Learning (which seems to be the dominant framework for thinking with robotic control) computes the expected "final reward" of a policy for each state $x$ and time $t$ via a value function $V_t^\pi(x)$. For some candidate action $u_t$, we can compute what state $x_{t+1}$ effects, and thus determine the reward if we perform $u_t$. Value function approximation suffers from some challenges that are discussed in Deisenroth et al.

In contrast to value-based methods, Policy Search (PS) methods use parameterized policies $\pi_\theta$ and attempt to learn the parameters $\theta$. PS is generally preferred to value approximation when it comes to robotics.

One class of Policy Search methods is Trajectory Optimization, which defines some cost function with respect to trajectories $\tau$ and attempts to minimize $J$ with $\tau$ as a free parameter.

Optimal Control seems to encompass Trajectory Optimization and stuff like "Linear Gaussian Controllers" and related mathematical theory. I haven't completely pinpointed it's semantic meaning in relation to the other vocab words.

Paper Review: Embed-to-Control (E2C)

This paper is better understood if the reader first reads Auto-Encoding Variational Bayes, which is how $z$'s are sampledd in this paper. Alternatively, my blog post on implementing DRAW has a gentle introduction on this.

Under a trajectory optimization (optimal control?) framework, we wish to solve the following optimization problem at time $t$:

$\DeclareMathOperator*{\argmin}{arg\,min}$
$$(x^*_{t:t+T}, u^*_{t:t+T}) = \argmin_{x,u}J(x_{t:t+T},u_{t:t+T})$$

That is, we want to determine the states $x_{t:t+T}$ and robot controls $u_{t:t+T}$ that minimize some penalty for the next $T$ discretized time steps.

How do we sample this trajectory space? $u_1$ and $x_1$ determine $x_2$ via the complex nonlinear dynamics between the robot and its environment. We cannot compute $x_2$ short of supplying $u_1$ and $x_1$ to a physics simulator: we could choose an initial condition $x_0$ and run our robot forwards in time for $T$ steps, then evaluate $J$. If we want to try a different trajectory, we run another simulation. But resetting the state back to $x_0$ is certainly not an option if our robot is running in the real world!

The basic idea of E2C is to endow the robot with the ability to "predict what will happen to a system" for a given trajectory. Basically, the complex nonlinear dynamics & physics governing a system are approximated via neural network. We can hallucinate as many $
tau$'s as we want, which allows us to measure $J$ which gives us the ability to run stochastic optimal contrl (SOC) algorithms on it.

The idea of applying "future prediction" to SOC algorithms is not new, but the main contribution of this paper is as follows: rather than inferring a future observation $x_{t+1}$ from current system observation $x$ (which might be a high-dimensional), we encode $x_t$ into a low-dimensional latent state $z_t$, and then run our "prediction function" on $z_t$ to obtain the next predicted latent state $z_{t+1}$. This is then decoded back into a future observation $x_{t+1}$.

The key insight here is that state prediction in low-dimensional latent space $Z$ is easier to learn than prediction in a high-dimensional observation space $X$.

Here's a diagram I drew that clarifies the latent encoder mechanism (this portion is only responsible for hallucinating future states, which is then supplied as input to a SOC algorithm).




I hypothesize that the latent space $Z$ learned by the E2C model has some really interesting topologies that might give insight into the properties of the robot (is it moving in a low or high dimensional space? Is the trajectory distribution of the policy multimodal?)

Paper Review: Muscle-Based Locomotion

This is an animation paper, and does not require much background knowledge to understand. It basically uses a genetic algorithm to perform end-to-end optimization of muscle-routing and muscle activation parameters (strength, delay, etc.).

Torques at each joint are determined via Jacobian transpose control, then the joint torques are converted into muscle forces.

The reason why genetic algorithms are used in this paper is because a physics simulation is needed to measure the cost of a control policy (so the cost function is not trivially differential).

This paper works because of the inherent dynamical stability of muscle-based actuation during biped walking. Biped walking falls naturally out of the optimization process if optimized for trunk orientation, head position, and matching some intermediate poses. However, it's not obvious how to generalize muscle-based control to other tasks by exploiting bio-mechanical feedback rules.

I see Deep Learning as especially promising for robotic control in domains where the robot dynamics model is analytically intractable (unlike a rigid joint-based arm). Rather than trying to come up with a Jacobian-based control scheme for muscle, why don't we just train it on the target task, and let it figure out it's own control scheme that manually compensates for it's intrinsic biomechanics?

Furthermore, using E2C might reveal something interesting about the latent space of muscle-dynamics (do they live on the same manifold as joint-based locomotion? Most likely not...)

Questions and Ideas


$x_{\text{goal}}$ is a partial or complete observation of the system state that the robot is trying to realize (in the E2C paper, the examples suggest that it is a full 2D image of the robot sitting in its environment).

However, if the environment involves moving pieces (i.e. a tub of water sloshing around, falling obstacles) that are irrelevant to the goal state, our  $x_{\text{goal}}$ is unlikely to contain a perfect depiction of the final state. This means that we need to encode $x$ in such a way such that these are ignored in $z_goal$.

Implementation Plan and Development Progress:


I'm starting this project fairly late into the semester, so I'll be working really hard on this in order to meet the ambitious goals of this project before I graduate in May.

For the 2D simulator, I successfully installed LiquidFun, which is a 2D Rigid-body and fluid simulator written in C++. Part of the reason I chose LiquidFun over Box2D was that Box2D has issues compiling on my laptop, and LiquidFun provides a superset superset of Box2D features anyway.

My goal by next week is to write a simple worm simulation and a manual controller to prove that locomotion is possible within the simulation framework I am using.

Note to self: turns out that linking the static library requires me to put -lliquidfun AFTER the -o main main.cpp invocation.

g++ -I/contrib/projects/liquidfun/liquidfun/Box2D -L/contrib/projects/liquidfun/liquidfun/Box2D/Box2D/Release -o main main.cpp -lliquidfun

 Someone has already implemented E2C in Torch7, but in a language I'm not proficient in writing/debugging. I will translate to Python + TensorFlow.

In other words: "I like wood better than steel, so I'll build this bridge out of wood".


Additional Reading I'm Looking Into: