CS2980 Research Blog: First Post: Project Proposal

The goals of this project are to:

Implement muscle-based control of robots using techniques from Embed-to-Control.
Examine the topological properties of the latent state space dynamics involved in robotic control.

I intend to draw ideas from the following papers:

Watter et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. arXiv 2015.

Geijtenbeek et al. Flexible Muscle-based Locomotion for Bipedal creatures. SIGGRAPH 2013.

Project Goals/Deliverables:

2D simulation of a simple muscle-based worm with analysis of latent space dynamics found via E2C model.
Same as above, but with more complex robot (biped walker).
Generalization to 3D(reach target).

Possible Extensions:

Gallery of learned state spaces for various control problems / robot body plans.
Add spatial read/write attention to the model to reduce complexity burden that the transition network needs to learn (inspired by the DRAW paper).

Background (Robotics/RL Vocabulary and Problem Formulation)

I'll attempt to explain some of the vocabulary I learned from reading A Survey on Policy Search for Robotics. I should warn the reader that I'm learning this as I go, and my understanding of how these pieces fit together semantically may be incomplete / incorrect.

Robots operate in a state space $x$ that consists of internal states (joint angles, velocities, end effector position) as well as external states (e.g. the environment).

The robot chooses motor commands $u$ according to a control policy $\pi$. A Control Policy is a function $\pi$ that outputs action $u$ given the state of system $x_t$ at time $t$. $\pi$ can be stochastic ($\pi(u_t|x_t)$) or deterministic $(u=\pi(x_t))$

The motor commands $u$ alter the state of the world (including the robot), so the state at the next discrete time step $t+1$ is thought of as being drawn from a probability distribution $p(x_{t+1}|x_t,u_t)$.

The joint states and actions of the robot across time form a trajectory $\tau=(x_1,u_1,...,x_T,u_T)$. For a stochastic policy, the trajectory can be formulated as a joint distribution over $(x_1,u_1,...,x_T,u_T)$.

Traditional Reinforcement Learning (which seems to be the dominant framework for thinking with robotic control) computes the expected "final reward" of a policy for each state $x$ and time $t$ via a value function $V_t^\pi(x)$. For some candidate action $u_t$, we can compute what state $x_{t+1}$ effects, and thus determine the reward if we perform $u_t$. Value function approximation suffers from some challenges that are discussed in Deisenroth et al.

In contrast to value-based methods, Policy Search (PS) methods use parameterized policies $\pi_\theta$ and attempt to learn the parameters $\theta$. PS is generally preferred to value approximation when it comes to robotics.

One class of Policy Search methods is Trajectory Optimization, which defines some cost function with respect to trajectories $\tau$ and attempts to minimize $J$ with $\tau$ as a free parameter.

Optimal Control seems to encompass Trajectory Optimization and stuff like "Linear Gaussian Controllers" and related mathematical theory. I haven't completely pinpointed it's semantic meaning in relation to the other vocab words.

Paper Review: Embed-to-Control (E2C)

This paper is better understood if the reader first reads Auto-Encoding Variational Bayes, which is how $z$'s are sampledd in this paper. Alternatively, my blog post on implementing DRAW has a gentle introduction on this.

Under a trajectory optimization (optimal control?) framework, we wish to solve the following optimization problem at time $t$:

$\DeclareMathOperator*{\argmin}{arg\,min}$
$$(x^*_{t:t+T}, u^*_{t:t+T}) = \argmin_{x,u}J(x_{t:t+T},u_{t:t+T})$$

That is, we want to determine the states $x_{t:t+T}$ and robot controls $u_{t:t+T}$ that minimize some penalty for the next $T$ discretized time steps.

How do we sample this trajectory space? $u_1$ and $x_1$ determine $x_2$ via the complex nonlinear dynamics between the robot and its environment. We cannot compute $x_2$ short of supplying $u_1$ and $x_1$ to a physics simulator: we could choose an initial condition $x_0$ and run our robot forwards in time for $T$ steps, then evaluate $J$. If we want to try a different trajectory, we run another simulation. But resetting the state back to $x_0$ is certainly not an option if our robot is running in the real world!

The basic idea of E2C is to endow the robot with the ability to "predict what will happen to a system" for a given trajectory. Basically, the complex nonlinear dynamics & physics governing a system are approximated via neural network. We can hallucinate as many $
tau$'s as we want, which allows us to measure $J$ which gives us the ability to run stochastic optimal contrl (SOC) algorithms on it.

The idea of applying "future prediction" to SOC algorithms is not new, but the main contribution of this paper is as follows: rather than inferring a future observation $x_{t+1}$ from current system observation $x$ (which might be a high-dimensional), we encode $x_t$ into a low-dimensional latent state $z_t$, and then run our "prediction function" on $z_t$ to obtain the next predicted latent state $z_{t+1}$. This is then decoded back into a future observation $x_{t+1}$.

The key insight here is that state prediction in low-dimensional latent space $Z$ is easier to learn than prediction in a high-dimensional observation space $X$.

Here's a diagram I drew that clarifies the latent encoder mechanism (this portion is only responsible for hallucinating future states, which is then supplied as input to a SOC algorithm).

I hypothesize that the latent space $Z$ learned by the E2C model has some really interesting topologies that might give insight into the properties of the robot (is it moving in a low or high dimensional space? Is the trajectory distribution of the policy multimodal?)

Paper Review: Muscle-Based Locomotion

This is an animation paper, and does not require much background knowledge to understand. It basically uses a genetic algorithm to perform end-to-end optimization of muscle-routing and muscle activation parameters (strength, delay, etc.).

Torques at each joint are determined via Jacobian transpose control, then the joint torques are converted into muscle forces.

The reason why genetic algorithms are used in this paper is because a physics simulation is needed to measure the cost of a control policy (so the cost function is not trivially differential).

This paper works because of the inherent dynamical stability of muscle-based actuation during biped walking. Biped walking falls naturally out of the optimization process if optimized for trunk orientation, head position, and matching some intermediate poses. However, it's not obvious how to generalize muscle-based control to other tasks by exploiting bio-mechanical feedback rules.

I see Deep Learning as especially promising for robotic control in domains where the robot dynamics model is analytically intractable (unlike a rigid joint-based arm). Rather than trying to come up with a Jacobian-based control scheme for muscle, why don't we just train it on the target task, and let it figure out it's own control scheme that manually compensates for it's intrinsic biomechanics?

Furthermore, using E2C might reveal something interesting about the latent space of muscle-dynamics (do they live on the same manifold as joint-based locomotion? Most likely not...)

Questions and Ideas

$x_{\text{goal}}$ is a partial or complete observation of the system state that the robot is trying to realize (in the E2C paper, the examples suggest that it is a full 2D image of the robot sitting in its environment).

However, if the environment involves moving pieces (i.e. a tub of water sloshing around, falling obstacles) that are irrelevant to the goal state, our $x_{\text{goal}}$ is unlikely to contain a perfect depiction of the final state. This means that we need to encode $x$ in such a way such that these are ignored in $z_goal$.

Implementation Plan and Development Progress:

I'm starting this project fairly late into the semester, so I'll be working really hard on this in order to meet the ambitious goals of this project before I graduate in May.

For the 2D simulator, I successfully installed LiquidFun, which is a 2D Rigid-body and fluid simulator written in C++. Part of the reason I chose LiquidFun over Box2D was that Box2D has issues compiling on my laptop, and LiquidFun provides a superset superset of Box2D features anyway.

My goal by next week is to write a simple worm simulation and a manual controller to prove that locomotion is possible within the simulation framework I am using.

Note to self: turns out that linking the static library requires me to put -lliquidfun AFTER the -o main main.cpp invocation.

g++ -I/contrib/projects/liquidfun/liquidfun/Box2D -L/contrib/projects/liquidfun/liquidfun/Box2D/Box2D/Release -o main main.cpp -lliquidfun

Someone has already implemented E2C in Torch7, but in a language I'm not proficient in writing/debugging. I will translate to Python + TensorFlow.

In other words: "I like wood better than steel, so I'll build this bridge out of wood".

Additional Reading I'm Looking Into:

Levine et al. End-to-End Training of Deep Visuomotor Policies. arXiv 2015.
Levine 2013. Exploring Deep and Recurrent Architectures for Optimal Control. arXiv 2013.
http://www.cs.ubc.ca/~van/papers/2015-TOG-terrainRL/2015-TOG-terrainRL.pdf
https://homes.cs.washington.edu/~todorov/courses/amath579/reading/Toussaint.pdf
http://papers.nips.cc/paper/3297-receding-horizon-differential-dynamic-programming.pdf

CS2980 Research Blog

Tuesday, March 22, 2016

First Post: Project Proposal