CS2980 Research Blog: March 2016

Thursday, March 31, 2016

On Discriminative Regularization

The paper "Discriminative Regularization for Generative Models" makes an important insight - we can use the output of a classifier as a regularization term within the objective function for training generative models.

Other places where this "multi-task regularization" comes up:

Learning Physical Intuition of Block Towers by Example - in addition to "predicting where the blocks will end up", the network is also asked to predict a binary "will the blocks fall?" In a hand-wavy sense, this acts as a way to "force" the internal representation to be useful for classifying an important "high-level" feature: namely, whether the blocks fell or not. As humans, we tend to be more interested in whether the blocks fell or not than the exact L2 distance between where the blocks fell and where we thought they would fall.

Indeed, the images produced by E2C (and DRAW, to some extent!) appear blurry. D.R.

How might we apply D.R. to E2C?

Human perception is goal-driven. The objective term corresponding to the fidelity of latent space encoding/decoding should not only be penalized by deviations in prediction (i.e. binary CE between x, x_recons) but also be penalized in some "goal-space":

Some ideas;

Binary variable of whether agent is in free fall or not (to borrow directly from Lerer et al. 2016). Only applies to situations with gravity I think.
Scalar "value function" to predict the cost of a random T=10 horizon trajectory. We can compute this only using a random sample of controls, so this is actually fairly easy to sample. Given $x_t$, $u_{t},u_{t+1},...,u_{t+T} \in U$ we hallucinate $\hat{X}_{t:t+T}=\{\hat{x}_{t+1},\hat{x}_{t+2},...,\hat{x}_{t+T}\}$. Then we feed $\hat{X}_{t:t+T}$ (T images) into some big convnet (not unlike the Atari-playing game) and use that to compute the cost. Meanwhile, we can run the simulator on the actual controls $u_{t},u_{t+1},...,u_{t+T} \in U$ and use that to obtain $X_{t:t+T}$ and the true costs $J$ of the local trajectory. Of course, we don't want to bias the representation towards a very specific cost function (i.e. a very specific goal state) so we could sample a variety of goal images.
Approximate the value $c(z_t,u_t)$ for a randomly sampled $z_{\text{goal}}$.
Binary variable - "is the agent touching an obstacle?".

I'm currently working my way through "One-Shot Generalization in Deep Generative Models" has some really interesting ideas whose theoretical foundations I don't fully understand, but I feel will give me a much better grasp on understanding how attention mechanisms might help with improving E2C performance.

E2C is working... sort of

Finally got E2C implementation working.

Here's the visualization of the latent space of the plane task (moving randomly around in a simple 2D maze) unfolding over time for the first few iterations.

For the first ~5000 iterations, the latent space is completely mixed, but the colors separate somewhat during training.

Unfortunately, the results do not look as pretty as those of the paper (latent space becomes nearly identical to true space), but perhaps it is because my data points are less densely populated at the bottom of the image? I.e. the state space is not uniformly sampled (the trajectories sample more from the top region than the bottom one.

The next steps:

Get iLQG control working with this agent. The task space hasn't properly unfolded, which is kind of concerning.
Resume work on Box2D + LiquidFun implementation of muscle robot.

A rather daunting difficulty right now is the static nature of the observations. The environment isn't moving, and I'm uncertain how "accurate" the latent manifold will be in the presence of visual noise.

For example, can I get the worm to cross a moving platform across a chasm?

Wednesday, March 30, 2016

Reparameterization trick for Q_psi

Killed 2 bugs with one stone - turns out that I don't need to compute $C_t=A\Sigma A^T+H$ in order to generate a sample from it!

To sample $z_{t+1} \sim N(A\mu+Bu+o, A\Sigma A^T+H)$, let $\omega\sim N(0,H)$ and $x \sim N(0,\Sigma)$. $\Sigma$ is diagonal so we can sample $x$ using the ordinary reparameterization trick outlined in the VAE paper.

Let $y=Ax+\omega$.

Then we have
$$\mu_y=E[Ax+\omega]=E[Ax]+E[\omega]=A*0+0=0$$
$$\Sigma_y=\text{Var}[Ax+\omega]=\text{Var}[Ax]+\text{Var}[\omega]=A\Sigma A^T + H$$

Which yields us the exact covariance we want!

Then we can just compute $z_{t+1}=A\mu+Bu+o+ y$. Note that we could have centered $x$ at $\mu$, in which $\mu_y=A\mu$, to which we just add $Bu+o$, but it turns out that we need to compute $\mu_{z_{t+1}}$ anyway for measuring the KL divergence later.

Latent Loss Term

When I train the model to just minimize L_x, and then add the L_z term later and re-train, the network learns fine. However, when I try to minimize both L_x+L_z right off the bat, the gradient blows up.

Fixed it - I had L_z implemented incorrectly (wrong sign somewhere).

Now it trains with both L_z + L_x.

Figured it out

Fixed the VAE bug.

Turns out that my VAE model was correctly implemented, but during testing I was forgetting to invert the images I was passing in (that's what I was doing during training). Of course, this messes everything up. The reconstruction losses I was obtaining was actually converging very close to 0, which was an accurate reflection of what was going on during training.

The issue was my sloppy programming skills. I could have avoided this mistake (and saved a couple days of grief) by simplifying the data interface to avoid having to manually process it after calling sample().

Sent an email to Manuel Watter asking about the KL divergence business and some other questions. Here's what he says:

1. You're right, the KL can explode. Ideally, the network should stay away from the problematic values for v and r, but that's of course not guaranteed. John Assael's approximation is only valid if, as you said, the is only insignificant covariance between the state dimensions, i.e. small perturbations of v and r. This becomes problematic for example in the case of velocities, because they obviously influence the position. The network might choose a state representation that minimizes this influence then, but it's hard to say.
2. Both of your tests should work here and I think the MNIST reconstructions look not too bad. In 10 dimensions, he really can't do more than represent an average version of the numbers and since there's no label information, some of those get blurred together. For the maze on the other hand, he should be able to give almost perfect reconstructions, or at least get the positions of the agent right. From your settings, the minibatch size seems a little odd. 1000 is pretty much, try 128 or 64. I'm not sure about the beta1 value of Adam, but I think we always had 0.1.
About inverting the images: For our training, the background is always 0 and the agent 1, we just switched it for the visualization. The diffculty here is the sparse gradient information, which in the past lead to the agent information being lost, but Adam should solve that issue.
To make sure, that this really works, try to create a dataset which has an image for every possible position of the agent and sample your minibatches from that. This has to work for the VAE and you can use it to visualize the latent spaces. You might also try to compare your implementation to e.g Lasagne to find out where it goes wrong.
3. Correct, that's a mistake :/
4. Yes, we didn't specify anything here. We assume that the inference network handles all the noise.

Here's a VAE reconstruction of *just* the images:

Next steps:

Implement a way to visualize the latent space of trained models as a mapping of it's "true" task space. I will need a "Variational Model" class of some sort to be able to abstract this.
Re-implement the VAE task with E2C-type modules.
I'm thinking of adding a logarithmic barrier to guard against v^r < -1 condition. Maybe that will help?

Tuesday, March 29, 2016

Project Update

My E2C implementation is having some issues that I've been trying to debug over the last few days. Basically the training process is unstable (I'm using the same random seed).

Part of the ratio of determinants in the computation of the KL divergence between $N(\mu_0,A\Sigma_0A^T)$ includes a $log(1+v^Tr)$ term. $vr^T$ is the perturbation of of an identity matrix to create $A$, which is used in the linearized dynamics of the robot. Vectors $v$ and $r$ are generated using an affine transformation of $h_\psi^\text{trans}$, which implies that $v^Tr$ could be less than -1, in which the log will cause an underflow. I'm not quite sure how to get around this. The Torch implementation of E2C seems to compute the ratio of determinants via torch.log(q.sigma):sum() - torch.log(p.sigma):sum() but that assumes A to be diagonal. Perhaps it's okay to assume A is diagonally-dominant, since $vr^T$ is supposed to be a small perturbation?
My plane dataset implementation generates an independent set of obstacles for each trajectory, but this is incorrect. z_dim=2, which intuitively is "just enough" to encode the position of the robot. But we are asking the autoencoder to learn how to represent the obstacle positions as well. The obvious solution is to modify the task so that the obstacle state is fixed across all samples, and is "baked" into the network weights of the autoencoder during training.
Originally I had the robot as a black square on a white background (obstacles black). It turns out that white robot and obstalces on a black background work better.
Using GradientDescent instead of Adam prevented blowup (not sure if it has to do with the Adam $\beta_1$, $\beta_2$ parameters)
E2C is too complicated to debug, and I think there are issues with my KL divergence term implementation. To simplify things, I implemented a simple VAE. It works just fine on the MNIST task (naturally when z=2, network doesn't seem to be able to learn efficient encodings for all 10 digits), but fails to accurately encode the position of the robot. I tried downscaling the 40x40 plane task to 28x28 (same size as mnist) but performance is still bad.
L_z term is blowing up even in VAE.
Perhaps VAE is not learning plane task properly because weights are not convolutional and the spatial semantics are lost? I will try robot size of 1 pixel on the blank plane dataset and see if that learns properly.

Tuesday, March 22, 2016

Week 3 Project Update

Stuff I did this week:

I've implemented the E2C model and the (synthetic) dataset for the 2D plane navigation task. There are some bugs in the code that I need to work out, but I have a pretty good grasp on the paper now.
I've started implementing a couple stochastic optimal control algorithms (iLQG, AICO) from this paper. I've written them in MATLAB but will translate to Python later for actual simulation.
I have a better formulation for the project I am trying to tackle, which I describe below:

In the Muscle-based Locomotion Paper, the robot is controlled by determining robot joint angles (corresponding to locomotive pose), then using a modified Jacobian-transpose controller to determine muscle activations that approximately effect the corresponding joint torques. This might be suitable for generating walking animations, but is not applicable to real robotics for the following reasons:

The controller is not model-free: i.e. we need explicit to explicitly model rigid-body dynamics before we can solve for muscle activation. If our robot has deforming parts (like an octopus or even a squishy backpack), this gets difficult.
Robots are not able to accurately measure their localization, pose, and other state variables precisely, which makes planning difficult.

For my project, I propose Model-Free Stochastic Optimal Control for Muscle-Based Robots.

The construction is as follows:

Given a fixed robot body plan (muscle routing) and environment (same as task environment), generate a dataset $D$ of initial state observations (images), random action by the robot, and resulting state transition (another image).
Train E2C on $D$ to learn dynamics of how the muscle system interacts with the surrounding environment.
Use SOC algorithm to perform optimal control (moving from one side of the floor to the other). The robot does not have proprioceptive access to it's "true" state; planning only utilizes an inferred state from it's observations of the environment.

This is useful because we can use this framework to control soft robots in a model-free manner, letting the robot learn it's own body dynamics implicitly.

I hope to finish this in the next two weeks (with the majority of implementation taking place over Spring Break). The remainder of the semester will be examining how to scale up the model to a more complex robot (bipedal walker, 3D, end-to-end refinement), as well as examining the topological latent space formed by E2C.

E2C Notes Part II

I hope to get the E2C model working on the toy 2D plane problem and inverted pendulum task by the end of this weekend.

Here's what I learned from reading the E2C paper a second time, this time thinking more carefully about implementation details. Please excuse typos in this post - I didn't proofread super carefully.

Problem Formulation for E2C Latent-Space Generative Model

The E2C model's job is: "given the previous system state observation $x_t$ of the system, predict the future system state observation $x_{t+1}$, if the robot were to apply robot controls $u_t$.

Therefore, each training sample is a tuple of the form $(x_t,u_1,x_{t+1}) \in D$. $x_{t+1}$ is the "label" that we use to supervise learning of $Q_\psi(\hat{Z}|Z,u)$.

If each observation $x_t$ is an image, then the robot doesn't have a way of inferring the velocity of moving objects, which is pretty important in extrapolating what happens in the future.

In the paper, the authors fix this by bundling $x_{t-1}$ with $x_t$ so the robot has the opportunity to extract velocity information. In the paper, each training sample tuple is actually formulated as:
$$(x_{t-1},x_t,u_t,x_{t+1})$$

$[x_{t-1},x_t]$ represents some system state that is in motion (some unknown control $u_{t-1}$ was applied to $x_{t-1}$, taking the system to $x_t$. $u_t$ is the "proposed control", and $x_{t+1}$ is the same supervised label. Note that since this is a Markov decision process, we need not encode $u_{t-1}$ because the future $x_{t+1}$ only depends on $x_t,u_t$ (we treat the system dynamics as a Markov process).

Ideas:

Instead of concatenating the previous image and letting inference learn the motion relationship, why not just compute the (optical) flow ($x_{t+1} - x_t$) and pass that in as a feature? Maybe that makes velocity / collision planning easier to learn?

In the mammalian nervous system, rods and cones do not communicate with the brain in a synchronous manner. Instead, portions of the image stream at time $t$ are perceived with a slight lag, and may not end up at the "prediction" layer until a later time $t+\Delta{t}$. Perhaps the brain can takes advantage of this "sampling across the time domain" asynchrony in order to infer optical flow.

We might make E2C do this by making the model recurrent for a short time sequence, like the DRAW paper. E2C operates over a sequence of inputs, and may learn to integrate / memorize an early input $x_0$ in order to compute an optical flow at $t=1$ using $x_1$ and $x_0$.

It'd be really cool to combine E2C with attention / temporal integration capabilities.

Step 1: constructing training dataset $D$

The training set $D$ is pre-computed (via a physics simulator) and then used to teach E2C how to predict future latent states of the system. Note that training $E2C$ involves no trajectory optimization, physics simulation, or robotic control -- just learning "precognition".

Note that we can reduce the disk size of our training dataset by sampling $(x_{t-1},x_t,u_t,x_{t+1})$ from randomly chosen intervals in continuous trajectories. The $x_{t+1}$ in one datapoint can be re-used as the $x_{t-1}$ or $x_t$ in another sample.

So the dataset is computed by initializing a reasonable starting state, then applying a bunch of random controls to it and letting the robot/system "flop around".

I sent an email to Manuel Watter and he confirmed that this is indeed the case.

I can think of a couple issues though:

If we sample $u \in U$ uniformly, we need to have "good coverage" over the range of all possible controls we would want the robot to be able to predict behavior over. If we want the network to be able to accurately predict the future for arbitrary control sampled from the full movement space $U$, then the training data necessarily grows exponentially with the degrees of freedom in the robot. In fact, it really grows with all pairwise combinations of state and control: $S \times U$, which is really depressing [1].

There's another large problem, which is the case in which the system is not ergodic. If gravity or entropy are involved in our dynamical system, this is very likely to be the case: the robot could fall down a gorge and any control $u$ in that state will result in the same future state: "stuck". Watter agrees with my hypothesis, but they haven't tried this.

Sampling a bunch of different initial states helps, but there's still the problem that a large space of data tuples consist of "stuck" predictions. If our dataset is even slightly biased towards being "stuck", then our robot will tend to predict "stuck" states.

Step 2: Training E2C on D

If training is successful (that's a big IF), E2C learns the Markov transition function of the latent space dynamics. Now we can hallucinate trajectories without actually simulating dynamics in the simulator. Given $x_0$, we can predict $z_1,...,z_T$, as well as the decoded images $x_1,x_2,...,x_T$.

One thing I'm still trying to figure out is how to compute $H_t$, the covariance that represents "system noise". The paper mentions it a couple times but doesn't give a clear definition. Is that a free parameter that we choose?

Step 3: Control

This is where the trained E2C is used to hallucinate state trajectories for use in. I'm currently working through the AICO paper.

[1] Humans have the ability to predict/navigate dynamical states $s$ and observations $x$ that they have never seen before. I imagine that an adult human's "training data" consists of a pretty comprehensive sampling of the space $U$ (available motions), but a very limited sampling of the space $X$.

If humans were using E2C to perform control, and is able to predict the future for a given $x \in X$, then that would imply that although $x$ may be novel, the human has actually seen a very similar (or exact) latent representation $z = m(x)$ before ("same problem, different wording"). The learned transition function for certain latent variables, like our intuition of "object permanence", "gravity", "bipedal walking" seere-construct m to be extremely well-developed, so although we may not have been in a particular scenario, we can still predict core pieces of the next observation using these "robust transition primitives".

Replicating this in E2C requires some big advances in unsupervised learning.
Concretely, we need to (1) learn some set of near-universal latent space dynamics that persist no matter what observation we are looking at, and (2) fill in "sparse information" using these "universal latent dynamics" and whatever available information we are partially familiar with.

Then, we somehow combine our sparse observation predictions to construct a dense observation prediction $x_{t+1}$, even though we never observed anything like $x_t$ before. This might solve our problem of exponential search space explosion.

[2] Random pondering: we can predict very high-dimensional systems (stock prices, all of our muscles, ). Our prediction abilities decreases as a function of "information distance" with respect to what we can perceive. Perhaps we not only learn a latent representation of world state, but also a latent representation of world state uncertainty.

Implementation Progress

I've constructed the dataset for the 2D plane navigation task. On the left is the starting state, on the right is the ending state after the robot (grey dot) has randomly moved one pixel to the left or right ($u$ value not shown).

I've also started implementation of the E2C model, and implemented basically everything except the loss nodes (the tricky part).

First Post: Project Proposal

The goals of this project are to:

Implement muscle-based control of robots using techniques from Embed-to-Control.
Examine the topological properties of the latent state space dynamics involved in robotic control.

I intend to draw ideas from the following papers:

Watter et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. arXiv 2015.

Geijtenbeek et al. Flexible Muscle-based Locomotion for Bipedal creatures. SIGGRAPH 2013.

Project Goals/Deliverables:

2D simulation of a simple muscle-based worm with analysis of latent space dynamics found via E2C model.
Same as above, but with more complex robot (biped walker).
Generalization to 3D(reach target).

Possible Extensions:

Gallery of learned state spaces for various control problems / robot body plans.
Add spatial read/write attention to the model to reduce complexity burden that the transition network needs to learn (inspired by the DRAW paper).

Background (Robotics/RL Vocabulary and Problem Formulation)

I'll attempt to explain some of the vocabulary I learned from reading A Survey on Policy Search for Robotics. I should warn the reader that I'm learning this as I go, and my understanding of how these pieces fit together semantically may be incomplete / incorrect.

Robots operate in a state space $x$ that consists of internal states (joint angles, velocities, end effector position) as well as external states (e.g. the environment).

The robot chooses motor commands $u$ according to a control policy $\pi$. A Control Policy is a function $\pi$ that outputs action $u$ given the state of system $x_t$ at time $t$. $\pi$ can be stochastic ($\pi(u_t|x_t)$) or deterministic $(u=\pi(x_t))$

The motor commands $u$ alter the state of the world (including the robot), so the state at the next discrete time step $t+1$ is thought of as being drawn from a probability distribution $p(x_{t+1}|x_t,u_t)$.

The joint states and actions of the robot across time form a trajectory $\tau=(x_1,u_1,...,x_T,u_T)$. For a stochastic policy, the trajectory can be formulated as a joint distribution over $(x_1,u_1,...,x_T,u_T)$.

Traditional Reinforcement Learning (which seems to be the dominant framework for thinking with robotic control) computes the expected "final reward" of a policy for each state $x$ and time $t$ via a value function $V_t^\pi(x)$. For some candidate action $u_t$, we can compute what state $x_{t+1}$ effects, and thus determine the reward if we perform $u_t$. Value function approximation suffers from some challenges that are discussed in Deisenroth et al.

In contrast to value-based methods, Policy Search (PS) methods use parameterized policies $\pi_\theta$ and attempt to learn the parameters $\theta$. PS is generally preferred to value approximation when it comes to robotics.

One class of Policy Search methods is Trajectory Optimization, which defines some cost function with respect to trajectories $\tau$ and attempts to minimize $J$ with $\tau$ as a free parameter.

Optimal Control seems to encompass Trajectory Optimization and stuff like "Linear Gaussian Controllers" and related mathematical theory. I haven't completely pinpointed it's semantic meaning in relation to the other vocab words.

Paper Review: Embed-to-Control (E2C)

This paper is better understood if the reader first reads Auto-Encoding Variational Bayes, which is how $z$'s are sampledd in this paper. Alternatively, my blog post on implementing DRAW has a gentle introduction on this.

Under a trajectory optimization (optimal control?) framework, we wish to solve the following optimization problem at time $t$:

$\DeclareMathOperator*{\argmin}{arg\,min}$
$$(x^*_{t:t+T}, u^*_{t:t+T}) = \argmin_{x,u}J(x_{t:t+T},u_{t:t+T})$$

That is, we want to determine the states $x_{t:t+T}$ and robot controls $u_{t:t+T}$ that minimize some penalty for the next $T$ discretized time steps.

How do we sample this trajectory space? $u_1$ and $x_1$ determine $x_2$ via the complex nonlinear dynamics between the robot and its environment. We cannot compute $x_2$ short of supplying $u_1$ and $x_1$ to a physics simulator: we could choose an initial condition $x_0$ and run our robot forwards in time for $T$ steps, then evaluate $J$. If we want to try a different trajectory, we run another simulation. But resetting the state back to $x_0$ is certainly not an option if our robot is running in the real world!

The basic idea of E2C is to endow the robot with the ability to "predict what will happen to a system" for a given trajectory. Basically, the complex nonlinear dynamics & physics governing a system are approximated via neural network. We can hallucinate as many $
tau$'s as we want, which allows us to measure $J$ which gives us the ability to run stochastic optimal contrl (SOC) algorithms on it.

The idea of applying "future prediction" to SOC algorithms is not new, but the main contribution of this paper is as follows: rather than inferring a future observation $x_{t+1}$ from current system observation $x$ (which might be a high-dimensional), we encode $x_t$ into a low-dimensional latent state $z_t$, and then run our "prediction function" on $z_t$ to obtain the next predicted latent state $z_{t+1}$. This is then decoded back into a future observation $x_{t+1}$.

The key insight here is that state prediction in low-dimensional latent space $Z$ is easier to learn than prediction in a high-dimensional observation space $X$.

Here's a diagram I drew that clarifies the latent encoder mechanism (this portion is only responsible for hallucinating future states, which is then supplied as input to a SOC algorithm).

I hypothesize that the latent space $Z$ learned by the E2C model has some really interesting topologies that might give insight into the properties of the robot (is it moving in a low or high dimensional space? Is the trajectory distribution of the policy multimodal?)

Paper Review: Muscle-Based Locomotion

This is an animation paper, and does not require much background knowledge to understand. It basically uses a genetic algorithm to perform end-to-end optimization of muscle-routing and muscle activation parameters (strength, delay, etc.).

Torques at each joint are determined via Jacobian transpose control, then the joint torques are converted into muscle forces.

The reason why genetic algorithms are used in this paper is because a physics simulation is needed to measure the cost of a control policy (so the cost function is not trivially differential).

This paper works because of the inherent dynamical stability of muscle-based actuation during biped walking. Biped walking falls naturally out of the optimization process if optimized for trunk orientation, head position, and matching some intermediate poses. However, it's not obvious how to generalize muscle-based control to other tasks by exploiting bio-mechanical feedback rules.

I see Deep Learning as especially promising for robotic control in domains where the robot dynamics model is analytically intractable (unlike a rigid joint-based arm). Rather than trying to come up with a Jacobian-based control scheme for muscle, why don't we just train it on the target task, and let it figure out it's own control scheme that manually compensates for it's intrinsic biomechanics?

Furthermore, using E2C might reveal something interesting about the latent space of muscle-dynamics (do they live on the same manifold as joint-based locomotion? Most likely not...)

Questions and Ideas

$x_{\text{goal}}$ is a partial or complete observation of the system state that the robot is trying to realize (in the E2C paper, the examples suggest that it is a full 2D image of the robot sitting in its environment).

However, if the environment involves moving pieces (i.e. a tub of water sloshing around, falling obstacles) that are irrelevant to the goal state, our $x_{\text{goal}}$ is unlikely to contain a perfect depiction of the final state. This means that we need to encode $x$ in such a way such that these are ignored in $z_goal$.

Implementation Plan and Development Progress:

I'm starting this project fairly late into the semester, so I'll be working really hard on this in order to meet the ambitious goals of this project before I graduate in May.

For the 2D simulator, I successfully installed LiquidFun, which is a 2D Rigid-body and fluid simulator written in C++. Part of the reason I chose LiquidFun over Box2D was that Box2D has issues compiling on my laptop, and LiquidFun provides a superset superset of Box2D features anyway.

My goal by next week is to write a simple worm simulation and a manual controller to prove that locomotion is possible within the simulation framework I am using.

Note to self: turns out that linking the static library requires me to put -lliquidfun AFTER the -o main main.cpp invocation.

g++ -I/contrib/projects/liquidfun/liquidfun/Box2D -L/contrib/projects/liquidfun/liquidfun/Box2D/Box2D/Release -o main main.cpp -lliquidfun

Someone has already implemented E2C in Torch7, but in a language I'm not proficient in writing/debugging. I will translate to Python + TensorFlow.

In other words: "I like wood better than steel, so I'll build this bridge out of wood".

Additional Reading I'm Looking Into:

Levine et al. End-to-End Training of Deep Visuomotor Policies. arXiv 2015.
Levine 2013. Exploring Deep and Recurrent Architectures for Optimal Control. arXiv 2013.
http://www.cs.ubc.ca/~van/papers/2015-TOG-terrainRL/2015-TOG-terrainRL.pdf
https://homes.cs.washington.edu/~todorov/courses/amath579/reading/Toussaint.pdf
http://papers.nips.cc/paper/3297-receding-horizon-differential-dynamic-programming.pdf