Hello.
First of all, let me thank the creators of ML-Agents, it appears to be a truly empowering and user-friendly toolkit. However, I couldn’t personally make use of that so far, as I’ve struggled to make my first project work.
Context: For a project in my master’s AI course, I’ve decided to try training an agent in a simple FPS setting. I took the Pyramids area and modified it as such:
The area is inhabited by two agents, who share the same behaviour:
- they can move forward/still/backward (action 2),
- right/still/left (action 3),
- rotate right/still/left (action 4),
- pull/don’t pull the trigger (action 1, shooting is further restricted by the fire rate), and
- apply/don’t apply a precision factor that reduces the move and rotation speeds (action 0).
They use a camera sensor with 108x60 resolution and collect no other observations. The camera also displays a crosshair that changes its colour to red if the agent points towards the other. This is what they (should) see:
Throughout the past week, I’ve tried a number of configurations. In the next comment (due to the max 5 images per comment limit), I will display tensorboard graphs for the following:
Besides the configuration files, the runs differ in the reward as well. Below, the commented out part of the reward was used by the runs “dcultimate” (obviously not ultimate, though…), “dcg” and “dcsac”, while the uncommented part was used by the “dcx” run.
Per-step reward:
// Encourage seeking / staying on targets
// if (HasTargetsInSight()) AddReward(1f / MaxStep);
// Pull trigger
if (triggerAction == 1)
{
// Discourage wastefulness
// AddReward(-1f / MaxStep);
// Shoot
if (Time.time >= nextTimeToFire)
{
nextTimeToFire = Time.time + 1f / fireRate;
ShootWeapon();
}
}
// Could ELO be falling due to registering -1f/MaxStep as a loss instead of a draw if the episode ends without a victor?
AddReward(0f);
Final reward:
public void ResolveHit(DCAgent winnerAgent, DCAgent loserAgent, float stepRatio)
{
// winnerAgent.AddReward(2f - stepRatio);
// loserAgent.AddReward(-2f + stepRatio);
winnerAgent.SetReward(1f);
loserAgent.SetReward(-1f);
winnerAgent.EndEpisode();
loserAgent.EndEpisode();
}
For reference, this is how my demo’s meta data looks:
I realise that, in encouraging “target in sight” behaviour, as well as other per-step rewards/penalties, I’m imposing a bias on the agent, but in this case, I thought that it was necessary: after millions of steps during training, the agents still seemed not to recognise each other. However, this was still the case even after adding this stimulant, hence my persisting problem…
Part 1/2