Learning State Abstractions for Long-Horizon Planning
Many tasks that we do on a regular basis, such as navigating a city, cooking a
meal, or loading a dishwasher, require planning over extended periods of time.
Accomplishing these tasks may seem simple to us; however, reasoning over long
time horizons remains a major challenge for today’s Reinforcement Learning (RL)
algorithms. While unable to plan over long horizons, deep RL algorithms excel
at learning policies for short horizon tasks, such as robotic grasping,
directly from pixels. At the same time, classical planning methods such as
Dijkstra’s algorithm and A$^*$ search can plan over long time horizons, but
they require hand-specified or task-specific abstract representations of the
environment as input.
To achieve the best of both worlds, state-of-the-art visual navigation methods
have applied classical search methods to learned graphs. In particular, SPTM 
and SoRB  use a replay buffer of observations as nodes in a graph and learn
a parametric distance function to draw edges in the graph. These methods have
been successfully applied to long-horizon simulated navigation tasks that were
too challenging for previous methods to solve.
Nevertheless, these methods are still limited because they are highly sensitive
to errors in the learned graph. Even a single faulty edge acts like a wormhole
in the graph topology that planning algorithms try to exploit, which makes
existing methods that combine graph search and RL extremely brittle. For
example, if an artificial agent navigating a maze thinks that two observations
on either side of a wall are nearby, its plans will involve transitions that
collide into the wall. Adopting a simple model that assumes a constant
probability $p$ of each edge being faulty, we see that the expected number of
faulty edges is $p|E| = O(|V|^2)$. In other words, errors in the graph scale
quadratically with the number of nodes in the graph.
We could do a lot better if we could minimize the errors in the graph. But how?
Graphs over observations in both simulated and real-world environments can be
prohibitively large, making it challenging to even identify which edges are
faulty. To minimize errors in the graph, then, we desire sparsity; we want to
keep a minimal set of nodes that is sufficient for planning. If we have a way
to aggregate similar observations into a single node in the graph, we can
reduce the number of errors and improve the accuracy of our plans. The key
challenge is to aggregate observations in a way that respects temporal
constraints. If observations are similar in appearance but actually far away,
then they should be aggregated into different nodes.
So how can we sparsify our graph while guaranteeing that the graph remains
useful for planning? Our key insight is a novel merging criterion called
two-way consistency. Two-way consistency can be viewed as a generalization of
value irrelevance to the goal-conditioned setting. Intuitively, two-way consistency
merges nodes (i) that can be interchanged as starting states and (ii) that can be
interchanged as goal states.
For an example of two-way consistency, consider the above figure. Suppose
during our node merging procedure we ask: can we merge the nodes with pink and
orange bottles according to two-way consistency? First, we note that moving
from the blue bottle to the pink bottle requires roughly the same work as
moving from the blue bottle to the orange bottle. So the nodes with pink and
orange bottles satisfy criterion (ii) because they can be interchanged as goal
states. However, while it is possible to start from the pink bottle and move to
the blue bottle, if we instead start at the orange bottle, the orange bottle
will fall to the floor and crash! So the nodes with pink and orange bottles
fail criterion (i) because they cannot be interchanged as starting states.
In practice, we can’t expect to encounter two nodes that can be perfectly
interchanged. Instead, we merge nodes that can be interchanged up to a
threshold parameter $tau$. By increasing $tau$, we can make the resulting
graph as sparse as we’d like. Crucially, *we prove in the paper that merging
according to two-way consistency preserves the graph’s quality up to an error
term that scales only linearly with the merging threshold $tau$.
Our motivation for sparsity, discussed above, is robustness: we expect smaller
graphs to have fewer errors. Furthermore, our main theorem tells us that we can
merge nodes according to two-way consistency while preserving the graph’s
quality. Experimentally, though, are the resulting sparse graphs more robust?
To test the robustness of Sparse Graphical Memory to errors in learned distance
metrics, we thinned the walls in the PointEnv mazes of . While PointEnv is a
simple environment with $(x, y)$ coordinate observations, thinning the walls is
a major challenge for parametric distance functions; any error in the learned
distance function will cause faulty edges across the walls that destroy the
feasibility of plans. For this reason, simply thinning the maze walls is enough
to break the previous state-of-the-art  resulting in a 0% success rate.
How does Sparse Graphical Memory fare? With many fewer edges, it becomes
tractable to perform self-supervised cleanup: the agent can step through the
environment to detect and remove faulty edges from its graph. The below figure
illustrates the results of this process. While the dense graph shown in red has
many faulty edges, sparsity and self-supervised cleanup, shown in green,
overcome errors in the learned distance metric, leading to a 100% success rate.
We see a similar trend in experiments with visual input. In both ViZDoom 
and SafetyGym  – maze navigation tasks that require planning from raw
images – Sparse Graphical Memory consistently improves the success of baseline
methods including SoRB  and SPTM .
In addition to containing fewer errors, Sparse Graphical Memory also results in
more optimal plans. On a ViZDoom maze navigation task , we find that SGM
requires significantly less steps to reach the final goal across easy, medium,
and hard maze tasks, meaning that the agent follows a shorter path to the final
Overall, we found that state aggregation with two-way consistency resulted in
substantially more robust plans over the prior state-of-the-art. While
promising, many open questions and challenges remain for combining classical
planning with learning-based control. Some of the questions we’re thinking
about are – how can we extend these methods beyond navigation to manipulation
domains? As the world is not static, how should we build graphs over changing
environments? How can two-way consistency be utilized beyond the scope of
graphical-based planning methods? We are excited about these future directions
and hope our theoretical and experimental findings prove useful to other
researchers investigating control over extended time horizons.
- Emmons*, Jain*, Laskin* et al. Sparse Graphical Memory for Robust Planning. NeurIPS 2020.
- Savinov et al. Semi-parametric Topological Memory for Navigation. ICLR 2019.
- Eysenbach et al. Search on the Replay Buffer: Bridging Planning and Reinforcement Learning. NeurIPS 2020.
- Wydmuch et al. ViZDoom Competitions: Playing Doom from Pixels. IEEE Transactions on Games, 2018.
- Ray et al. Benchmarking Safe Exploration in Deep Reinforcement Learning. Preprint, 2019.