Here's what I learned from reading the E2C paper a second time, this time thinking more carefully about implementation details. Please excuse typos in this post - I didn't proofread super carefully.
Problem Formulation for E2C Latent-Space Generative Model
The E2C model's job is: "given the previous system state observation $x_t$ of the system, predict the future system state observation $x_{t+1}$, if the robot were to apply robot controls $u_t$.Therefore, each training sample is a tuple of the form $(x_t,u_1,x_{t+1}) \in D$. $x_{t+1}$ is the "label" that we use to supervise learning of $Q_\psi(\hat{Z}|Z,u)$.
If each observation $x_t$ is an image, then the robot doesn't have a way of inferring the velocity of moving objects, which is pretty important in extrapolating what happens in the future.
In the paper, the authors fix this by bundling $x_{t-1}$ with $x_t$ so the robot has the opportunity to extract velocity information. In the paper, each training sample tuple is actually formulated as:
$$(x_{t-1},x_t,u_t,x_{t+1})$$
$[x_{t-1},x_t]$ represents some system state that is in motion (some unknown control $u_{t-1}$ was applied to $x_{t-1}$, taking the system to $x_t$. $u_t$ is the "proposed control", and $x_{t+1}$ is the same supervised label. Note that since this is a Markov decision process, we need not encode $u_{t-1}$ because the future $x_{t+1}$ only depends on $x_t,u_t$ (we treat the system dynamics as a Markov process).
Ideas:
Instead of concatenating the previous image and letting inference learn the motion relationship, why not just compute the (optical) flow ($x_{t+1} - x_t$) and pass that in as a feature? Maybe that makes velocity / collision planning easier to learn?
In the mammalian nervous system, rods and cones do not communicate with the brain in a synchronous manner. Instead, portions of the image stream at time $t$ are perceived with a slight lag, and may not end up at the "prediction" layer until a later time $t+\Delta{t}$. Perhaps the brain can takes advantage of this "sampling across the time domain" asynchrony in order to infer optical flow.
We might make E2C do this by making the model recurrent for a short time sequence, like the DRAW paper. E2C operates over a sequence of inputs, and may learn to integrate / memorize an early input $x_0$ in order to compute an optical flow at $t=1$ using $x_1$ and $x_0$.
It'd be really cool to combine E2C with attention / temporal integration capabilities.
Step 1: constructing training dataset $D$
The training set $D$ is pre-computed (via a physics simulator) and then used to teach E2C how to predict future latent states of the system. Note that training $E2C$ involves no trajectory optimization, physics simulation, or robotic control -- just learning "precognition".
Note that we can reduce the disk size of our training dataset by sampling $(x_{t-1},x_t,u_t,x_{t+1})$ from randomly chosen intervals in continuous trajectories. The $x_{t+1}$ in one datapoint can be re-used as the $x_{t-1}$ or $x_t$ in another sample.
So the dataset is computed by initializing a reasonable starting state, then applying a bunch of random controls to it and letting the robot/system "flop around".
I sent an email to Manuel Watter and he confirmed that this is indeed the case.
I can think of a couple issues though:
If we sample $u \in U$ uniformly, we need to have "good coverage" over the range of all possible controls we would want the robot to be able to predict behavior over. If we want the network to be able to accurately predict the future for arbitrary control sampled from the full movement space $U$, then the training data necessarily grows exponentially with the degrees of freedom in the robot. In fact, it really grows with all pairwise combinations of state and control: $S \times U$, which is really depressing [1].
There's another large problem, which is the case in which the system is not ergodic. If gravity or entropy are involved in our dynamical system, this is very likely to be the case: the robot could fall down a gorge and any control $u$ in that state will result in the same future state: "stuck". Watter agrees with my hypothesis, but they haven't tried this.
Sampling a bunch of different initial states helps, but there's still the problem that a large space of data tuples consist of "stuck" predictions. If our dataset is even slightly biased towards being "stuck", then our robot will tend to predict "stuck" states.
If training is successful (that's a big IF), E2C learns the Markov transition function of the latent space dynamics. Now we can hallucinate trajectories without actually simulating dynamics in the simulator. Given $x_0$, we can predict $z_1,...,z_T$, as well as the decoded images $x_1,x_2,...,x_T$.Step 2: Training E2C on D
One thing I'm still trying to figure out is how to compute $H_t$, the covariance that represents "system noise". The paper mentions it a couple times but doesn't give a clear definition. Is that a free parameter that we choose?
Step 3: Control
This is where the trained E2C is used to hallucinate state trajectories for use in. I'm currently working through the AICO paper.[1] Humans have the ability to predict/navigate dynamical states $s$ and observations $x$ that they have never seen before. I imagine that an adult human's "training data" consists of a pretty comprehensive sampling of the space $U$ (available motions), but a very limited sampling of the space $X$.
If humans were using E2C to perform control, and is able to predict the future for a given $x \in X$, then that would imply that although $x$ may be novel, the human has actually seen a very similar (or exact) latent representation $z = m(x)$ before ("same problem, different wording"). The learned transition function for certain latent variables, like our intuition of "object permanence", "gravity", "bipedal walking" seere-construct m to be extremely well-developed, so although we may not have been in a particular scenario, we can still predict core pieces of the next observation using these "robust transition primitives".
Replicating this in E2C requires some big advances in unsupervised learning.
Concretely, we need to (1) learn some set of near-universal latent space dynamics that persist no matter what observation we are looking at, and (2) fill in "sparse information" using these "universal latent dynamics" and whatever available information we are partially familiar with.
Then, we somehow combine our sparse observation predictions to construct a dense observation prediction $x_{t+1}$, even though we never observed anything like $x_t$ before. This might solve our problem of exponential search space explosion.
[2] Random pondering: we can predict very high-dimensional systems (stock prices, all of our muscles, ). Our prediction abilities decreases as a function of "information distance" with respect to what we can perceive. Perhaps we not only learn a latent representation of world state, but also a latent representation of world state uncertainty.
I've constructed the dataset for the 2D plane navigation task. On the left is the starting state, on the right is the ending state after the robot (grey dot) has randomly moved one pixel to the left or right ($u$ value not shown).
I've also started implementation of the E2C model, and implemented basically everything except the loss nodes (the tricky part).
Implementation Progress
I've constructed the dataset for the 2D plane navigation task. On the left is the starting state, on the right is the ending state after the robot (grey dot) has randomly moved one pixel to the left or right ($u$ value not shown).
I've also started implementation of the E2C model, and implemented basically everything except the loss nodes (the tricky part).
No comments:
Post a Comment