Individual Reward is Increased but Group Reward is Decreased

I am training a group of 4 agents in an ‘encirclement’ problem. Whereby the agents have to circle a target at a given radius and angular velocity, and they should be in a certain formation defined by the angular difference between adjacent neighbors.

I have an individual reward function which I’m pretty happy with, and it seems to generate relatively good results so far. I also add a small ‘hurry up’ negative group penalty for each update, and I add a -1 group reward when an agent goes out of bounds (and end the episode) and a +1 group reward when all of the agents are in good formation and are circling the target sufficiently.

However, these are the results that I get:

It seems clear to me that the cumulative reward for individuals is being properly maximized, but almost oppositely, the cumulative group reward is being decreased, as though this was the intention. For this training I used the following config:

trainer_type: poca
hyperparameters:
    batch_size: 1024
    buffer_size: 100000
    learning_rate: 0.0001
    beta: 0.01
    epsilon: 0.2
    lambd: 0.95
    num_epoch: 3
    learning_rate_schedule: constant
network_settings:
    normalize: false
    hidden_units: 64
    num_layers: 3
    vis_encode_type: simple
    memory: None
reward_signals:
    extrinsic:
        gamma: 0.99
        strength: 1.0
keep_checkpoints: 5
max_steps: 2000000000
time_horizon: 64
summary_freq: 50000
threaded: true

Is there any reason why it appears as though the group cumulative reward is being minimized? Is it my poor environment design or something in the algorithm that I am using?

Glad you’re checking out the new multi-agent features. The Group Reward is much, much smaller than the individual reward - so the agents have learned to sacrifice it in favor of maximizing the individual rewards.

I’d try removing the group penalties and/or dramatically decrease the magnitude of the individual rewards.

Thanks for your reply @ervteng_unity ! I see what you mean. Since I am adding an individual reward in the range [0, 1] on each step, but only adding a group reward (+1 in the win scenario and -1 in the lose scenario) on the last step of each episode. How should I aim to distribute these individual and group rewards?
Should the cumulative group reward be of a similar magnitude to the cumulative individual reward?

What is the best balance for getting the agents to simultaneously maximize the individual reward as well as the group reward?

On a side note, this slightly confuses me because if each agent does their job extremely well then the group should ‘win’.

I have altered how I give rewards, but I am still having problems converging both the agent reward and the group reward simultaneously.

I simplified the problem such that now I just want the agents to move to the radius of an imaginary circle surrounding a target. The radius of this circle is determined before-hand. I have the following reward function to do this:

private float EncirclementRadiusReward(float radiusError)
{
    if (radiusError <= 0.01f)
    {
         return (-1f * radiusError) + maxReward;
    }

    return 0f;
}

Where the radius error is given by the following function:

public float EncirclementRadiusError()
{
    var radius = Vector3.Distance(transform.position, m_targetPos);
    var radiusErr = Mathf.Abs(radius - desiredEncirclementRadius);
         
    // normalise radiusError
    var maxErr = maxDistToTarget - desiredEncirclementRadius;
    const float minErr = 0f;

    return NormaliseFloat(radiusErr, maxErr, minErr);
}

This reward is applied to each agent individually. I also apply a ‘hurry up penalty’ to each agent at each step:

Agent.AddReward(-0.5f / MaxEnvironmentSteps);

In terms of group rewards, I also give the same hurry-up penalty to the entire group, I add a reward of +1 on an episode win and -1 on an episode loss (either by an agent moving out of bounds or the number of steps exceeding the max).

A group win is given if all agents satisfy the following condition:

Agent.EncirclementRadiusError() <= 0.01f

(notice how the agent only receives a positive reward when the radius error is less than 0.01f, and the group only wins when all agents have radius error <= 0.01f).

However, when I run training using the same config as above, the following results are observed:

I really can’t understand how this is happening, is there something that I am missing here?

Thankyou