Cooperative Policy Learning with Pre-trained Heterogeneous Observation Representations

Wenlei Shi1          Xinran Wei2          Jia Zhang1          Xiaoyuan Ni3          Arthur Jiang4          Jiang Bian1          Tie-Yan Liu1

1Microsoft Research 2Beijing University of Posts and Telecommunications 3The Hong Kong University of Science and Technology 4Microsoft ARD Incubation Team


Multi-agent reinforcement learning (MARL) has been increasingly explored to learn the cooperative policy towards maximizing a certain global reward. Many existing studies take advantage of graph neural networks (GNN) in MARL to propagate critical collaborative information over the interaction graph, built upon inter-connected agents. Nevertheless, the vanilla GNN approach yields substantial defects in dealing with complex real-world scenarios since the generic message passing mechanism is ineffective between heterogeneous vertices and, moreover, simple message aggregation functions are incapable of accurately modeling the combinational interactions from multiple neighbors. While adopting complex GNN models with more informative message passing and aggregation mechanisms can obviously benefit heterogeneous vertex representations and cooperative policy learning, it could, on the other hand, increase the training difficulty of MARL and demand more intense and direct reward signals compared to the original global reward. To address these challenges, we propose a new cooperative learning framework with pre-trained heterogeneous observation representations. Particularly, we employ an encoder-decoder based graph attention to learn the intricate interactions and heterogeneous representations that can be more easily leveraged by MARL. Moreover, we design a pre-training with local actor-critic algorithm to ease the difficulty in cooperative policy learning. Extensive experiments over real-world scenarios demonstrate that our new approach can significantly outperform existing MARL baselines as well as operational research solutions that are widely-used in industry.


The overall structure of EncGAT-PreLAC. From left to right, it uses EncGAT for interaction representation and feeds the representations to the actor and critic headers. The overall model is first pre-trained with the local loss 𝐿 and then finetuned with a global loss 𝐿 loc.

The EncGAT model is designed to efficiently learn informative presentations from the complex interactive graph. The calculation procedure of the encoder is based on self-attention, which is similar to the transformer block. The aggregation among the same type of neighbors is then conducted through the decoder attention. In decoder attention, another scaled dot product attention is applied, which uses the feature vector as the query, and the encoded neighbors’ features matrix as the keys and values.

The network structure featured three characteristics based on the problem context of ECR(Empty Containers Repositioning). Firstly, it used the temporal attention to capture features of the sequential input. Secondly, we concatenate the edge features with neighbors before feeding it to EncGAT. Thirdly, both the port and the vessel features are fed into the action head. Finally, we stack two encoder-decoder attentions in the EncGAT and residual connections are used to prevent over-smoothing. The actor and critic headers consist of two fully-connected layers also with residual connections. We share the headers of actor and local critic between agents, which makes the overall framework inductive.

More details of our model structure can be found in the MARO Platform. Full paper and oral presentation accepted by AAMAS-2021


Baseline Comparison
We compare our framework with existing baseline methods with the total fulfillment ratio, that is the ratio of fulfilled demands to all demands, as the evaluation metric.

Ablation Study
We also conduct the ablation studies to answer:

  1. Does the encoder-decoder attention in EncGAT help heterogeneous information aggregation and intricate interaction understanding?
  2. Does PreLAC improve learning performance of cooperative policy?

Visualization of vertex embeddings in global-critic (a) and our method (b). The color identifies different ports and each point represents a feature vector in a batch.

Order proportion distribution between two ports in the original topology (a) and the new topology (b). Each column represents the proportions of orders from one port to orders. Note that two merged ports are the 9th and 10th columns.