Trouble with learning via self-play in a simple 1v1 FPS setting

Hello.

First of all, let me thank the creators of ML-Agents, it appears to be a truly empowering and user-friendly toolkit. However, I couldn’t personally make use of that so far, as I’ve struggled to make my first project work.

Context: For a project in my master’s AI course, I’ve decided to try training an agent in a simple FPS setting. I took the Pyramids area and modified it as such:

The area is inhabited by two agents, who share the same behaviour:

  • they can move forward/still/backward (action 2),
  • right/still/left (action 3),
  • rotate right/still/left (action 4),
  • pull/don’t pull the trigger (action 1, shooting is further restricted by the fire rate), and
  • apply/don’t apply a precision factor that reduces the move and rotation speeds (action 0).

They use a camera sensor with 108x60 resolution and collect no other observations. The camera also displays a crosshair that changes its colour to red if the agent points towards the other. This is what they (should) see:

Throughout the past week, I’ve tried a number of configurations. In the next comment (due to the max 5 images per comment limit), I will display tensorboard graphs for the following:

Besides the configuration files, the runs differ in the reward as well. Below, the commented out part of the reward was used by the runs “dcultimate” (obviously not ultimate, though…), “dcg” and “dcsac”, while the uncommented part was used by the “dcx” run.

Per-step reward:

// Encourage seeking / staying on targets
// if (HasTargetsInSight()) AddReward(1f / MaxStep);

// Pull trigger
if (triggerAction == 1)
{
    // Discourage wastefulness
    // AddReward(-1f / MaxStep);
 
    // Shoot
    if (Time.time >= nextTimeToFire)
    {
        nextTimeToFire = Time.time + 1f / fireRate;
        ShootWeapon();
    }
}

// Could ELO be falling due to registering -1f/MaxStep as a loss instead of a draw if the episode ends without a victor?
AddReward(0f);

Final reward:

public void ResolveHit(DCAgent winnerAgent, DCAgent loserAgent, float stepRatio)
{
    // winnerAgent.AddReward(2f - stepRatio);
    // loserAgent.AddReward(-2f + stepRatio);
    winnerAgent.SetReward(1f);
    loserAgent.SetReward(-1f);
    winnerAgent.EndEpisode();
    loserAgent.EndEpisode();
}

For reference, this is how my demo’s meta data looks:

5870059--624808--dcdemo.PNG

I realise that, in encouraging “target in sight” behaviour, as well as other per-step rewards/penalties, I’m imposing a bias on the agent, but in this case, I thought that it was necessary: after millions of steps during training, the agents still seemed not to recognise each other. However, this was still the case even after adding this stimulant, hence my persisting problem…

Part 1/2

PPO graphs:

SAC graphs:

ELO graph:

Notes:

  • I don’t have a Nvidia GPU, so I’m forced to train on CPU, which is why I’ve never reached close to the total 50M steps. The longest run took me 2 days.
  • Not sure what to think about SAC, it was hard to set up in the first place (it would regularly freeze the simulation for many minutes).

I’ve had a few thoughts about what I could be doing wrong:

  • Are my configurations or rewards simply inappropriate? Maybe some other mistake that I’ve missed?

  • Should I just have more faith that the training will eventually start going in the right direction? The thing is, I haven’t seen a glimpse of optimistic behaviour even after 6M steps…

  • Is the environment too complex and should be made more simple?

  • Is the problem itself too hard? Maybe the win condition (see the target and shoot) is too close to the loss condition (be seen by the target and shot).

Anyway, I’ve thought that it was about time that I consult some people more knowledgeable and experienced than me.

I thank you all in advance for your time and thoughts.

Part 2/2

Hi APDev,

Learning from raw images can take much more data than learning from vector observations. As such, 6 million steps from eight concurrent agents may be too few. It is also the case that more dense rewards will help training, as long as they are properly formatted. On top of this, you have an adversarial condition, which makes training more complex. I would perhaps try getting a simple single player version working with dense rewards, and work your way toward the more complex behavior.

1 Like

Now that you’ve mentioned it, the few related examples from the literature that I’ve seen are all single-player (agent) as well (e.g. [1609.05521] Playing FPS Games with Deep Reinforcement Learning or [1605.02097] ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning), so I probably should have taken the hint earlier.

Thank you, I will do as advised.