I am training a group of 4 agents in an ‘encirclement’ problem. Whereby the agents have to circle a target at a given radius and angular velocity, and they should be in a certain formation defined by the angular difference between adjacent neighbors.
I have an individual reward function which I’m pretty happy with, and it seems to generate relatively good results so far. I also add a small ‘hurry up’ negative group penalty for each update, and I add a -1 group reward when an agent goes out of bounds (and end the episode) and a +1 group reward when all of the agents are in good formation and are circling the target sufficiently.
However, these are the results that I get:
It seems clear to me that the cumulative reward for individuals is being properly maximized, but almost oppositely, the cumulative group reward is being decreased, as though this was the intention. For this training I used the following config:
trainer_type: poca
hyperparameters:
batch_size: 1024
buffer_size: 100000
learning_rate: 0.0001
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: constant
network_settings:
normalize: false
hidden_units: 64
num_layers: 3
vis_encode_type: simple
memory: None
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 2000000000
time_horizon: 64
summary_freq: 50000
threaded: true
Is there any reason why it appears as though the group cumulative reward is being minimized? Is it my poor environment design or something in the algorithm that I am using?