NerveNet: Learning Structured Policy with Graph Neural Networks

Tingwu Wang*
Renjie Liao*
Jimmy Ba
Sanja Fidler
University of Toronto   &   Vector Institute


We address the problem of learning structured policies for continuous control. In traditional reinforcement learning, policies of agents are learned by MLPs which take the concatenation of all observations from the environment as input for predicting actions. In this work, we propose NerveNet to explicitly model the structure of an agent, which naturally takes the form of a graph. Specifically, serving as the agent's policy network, NerveNet first propagates information over the structure of the agent and then predict actions for different parts of the agent. In the experiments, we first show that our NerveNet is comparable to state-of-the-art methods on standard MuJoCo environments. We further propose our customized reinforcement learning environments for benchmarking two types of structure transfer learning tasks, i.e., size and disability transfer. We demonstrate that policies learned by NerveNet are significantly better than policies learned by other models and are able to transfer even in a zero-shot setting.


Source Code and Demos



Performance on MuJoCo Benchmarks


We compare NerveNet with the standard MLP models and TreeNet. From the figures, we can see that NerveNet basically matches the performance of MLP in terms of sample efficiency as well as the performance after it converges. In most cases, the TreeNet is worse than NerveNet which highlights the importance of keeping the physical graph structure.

Zero-shot Performance



We examine the zero-shot performance without any fine-tuning. As we can see from Figures NerveNet outperforms all competitors on almost all settings. By including the results of running-length, we notice that NerveNet is the only model able to walk in the zero-shot evaluations of centipedes. As a matter of fact, the performance of NerveNet could be orders-of-magnitude better, and most of the time, agents from other methods cannot even move forward. We also notice that if transferred from CentipedeSix, NerveNet is able to provide walkable pretrained models on all new agents.

Finetuning RL Agents



We fine-tune for both size transfer and disability transfer experiments and show the training curves. From the figure, we can see that by using the pre-trained model, NerveNet significantly decreases the number of episodes required to reach the level of reward which is considered as solved.

Intepretable Features



Moreover, by examining the result videos of centipedes, we noticed that the "walk-cycle" behavior is observed for NerveNet but is not common for others. Walk-cycle are adopted for many insects in the world. For example, six-leg ants uses tripedal gait, where the legs are used in two separate triangles alternatively touching the ground.
We visualize and interpret the learned representations. We extract the final state vectors of nodes of NerveNet. We then apply 1-D and 2-D PCA on the node representations. We notice that each pair of legs is able to learn invariant representations, despite their different position in the agent. As we can see, there is a clear periodic behavior of our hidden representations learned by our model. Furthermore, the representations of adjacent left legs and the adjacent right legs demonstrate a phase shift, which further proves that our agents are able to learn the walk-cycle without any additional supervision.

Model Variants


As we can see from figures, the NerveNet-MLP and NerveNet-2 variants perform better than NerveNet-1. One potential reason is that sharing the weights of the value and policy networks makes the trust-region based optimization methods, like PPO, more sensitive to the weights of the value function.

Paper

[Paper 3.3MB] 

Citation
 
Tingwu Wang, Renjie Liao, Jimmy Ba and Sanja Fidler.

NerveNet: Learning Structured Policy with Graph Neural Networks.




Last Update: Nov 16th, 2017
Web Page Template: this