Unity 2020.3.3f / ml-agents 0.23.0 / communicator 1.3.0 / PyTorch 1.7.0 on macOS Big Sur 11.2.3
I’ve been excitedly playing with ML-Agents for about a month now on and off. I followed through some of the basic tutorials on YouTube and have been trying to find ANY kind of reading material I can on the topics, unfortunately it seems most of the results are all for the same few threads so I thought I’d ask my question here.
I have a basic training environment setup. It’s a small plane with 4 walls surrounding. The agent has a RayPerceptionSensor3D setup about mid height. It projects all around the environment and only detects two tags: “Obstacle” and “Goal”. The sensor can also detect layers “Default” and “TerrainLayer”. The walls are on the “TerrainLayer” and the goal object is on the “Default” layer. I have confirmed that the RayPerceptionSensor3D does indeed detect walls and the goal, but the usage is fairly fuzzy to me still and how to strengthen the associations.
The agent also has 9 observations of its own:
- Normalized agent position (x,y,z)
- Normalized goal position (x,y,z)
- Agent Forward (x,y,z)

My agent script detects when the Agent hangs out against a wall for too long and ends the episode with a negative reward.
For the first 40 times, the agent needs to move from its spawn to the goal. It receives a +1 reward for touching the goal and receives an additive -.075 reward while it’s touching any walls. After 40 attempts, the agent gets pretty good at this so I add walls by switching the environment. The environment size stays the same but 3 walls are placed in the environment, also on the “TerrainLayer” with tag “Obstacle”.
Once the walls are added, the agent does REALLY good if the goal is within its immediate sight of the RayPerceptionSensor3D. However, the agent is on one side of a wall and the goal is on the other side, it just seems to continually try and move itself into the wall and grind against the wall until I end the episode. I would expect odd behaviour like this as it tries to figure things out, however it seems to just do this. Even occasionally, it will fail the simple environment tests (no walls) simply because it moves itself into one corner, locks on the wall, and takes the negative reward until it’s respawned.
I did have one training run that ran overnight (about 6 million steps) that resulted in a good cumulative reward, but when I tried to use that brain to run through the training I observed something similar: sometimes the agent would go right for the goal, other times it would just seemingly give up and run itself into walls.
I’m not quite sure what I’m doing wrong here and as stated previously, the limited threads on this fairly new topic make finding answers quite confusing. I’ve tried tweaking things such as the length of the rays. Initially, the rays were long enough to span and touch all sides of the room which I strongly believe confused the AI into thinking it had limited moves.
I tried shortening the rays significantly hoping it would push the AI away when it gets close, but it just seems to latch onto the wall once it sees it.
A few days before posting this thread, I was incorrectly normalizing the coordinates and thought that would be it. Now, I calculate the bounds of the level and normalize coordinates like that. From my debug view, it works and is correct but I’m not so sure that made such a difference.
I have of course, tried tuning hyperparameters, but that doesn’t seem to make a huge difference in the training.
To me, it seems like the AI is not understanding that hugging walls is being negatively reinforced. I hesitate to make the negative rewards too strong (< -1) since that could potentially screw with AI normalization?
I had a small negative reward every step, but removed it because I thought that was being detrimental to training around the walls.
Hyperparameters:
behaviors:
FindExit:
framework: pytorch
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
# batch_size: 4096
# buffer_size: 10240
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 3
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 6.0e6
time_horizon: 64
summary_freq: 12000
threaded: true
Agent:
public class MyAgent : Agent
{
private Vector3 _SpawnPoint;
[Header("References")]
private ThirdPersonCharacter _thirdPersonController;
[SerializeField] BoundsCollector _LevelBounds;
[SerializeField] Transform ExtractionPoint;
[SerializeField] Testing_FindRandomPosition Training_RandomizeBoundsContainer;
[Header("Events")]
[SerializeField] UnityEvent _OnEpisodeBegin;
[SerializeField] UnityEvent _OnEpisodePass, _OnEpisodeFail;
[Header("Properties")]
[SerializeField] private bool _IsTouchingWall = false;
[SerializeField] float _TimeTouchingWall = 0f;
[SerializeField] float MoveSpeed = 2f;
[Header("Observations")]
[Tooltip("Now Normalized Extraction Point coordinate")]
[SerializeField] public Vector3 NormalizedDistanceFromGoal;
[SerializeField] public Vector3 NormalizedPlayerPosition;
[Header("Input")]
[SerializeField] private Vector3 _MovementVector3 = Vector3.zero;
private bool jump = false;
private bool fullyGrounded = false;
public override void Initialize()
{
base.Initialize();
_thirdPersonController = GetComponent<ThirdPersonCharacter>();
_SpawnPoint = transform.position;
}
private void Respawn()
{
transform.position = _SpawnPoint;
}
public override void OnEpisodeBegin()
{
base.OnEpisodeBegin();
_MovementVector3 = Vector3.zero;
_TimeTouchingWall = 0f;
_OnEpisodeBegin?.Invoke();
transform.position = _SpawnPoint;
}
public override void Heuristic(in ActionBuffers actionsOut)
{
var discreteOut = actionsOut.DiscreteActions;
if (Input.GetKey(KeyCode.A)) discreteOut[0] = 1; // L
else if (Input.GetKey(KeyCode.D)) discreteOut[0] = 2; // R
if (Input.GetKey(KeyCode.W)) discreteOut[0] = 3; // Up/forward
else if (Input.GetKey(KeyCode.S)) discreteOut[0] = 4; // down/backward
}
public override void OnActionReceived(ActionBuffers actionsOut)
{
var discreteActions = actionsOut.DiscreteActions;
switch((int)discreteActions[0])
{
case 1: //L
_MovementVector3 = MoveSpeed * Vector3.left;
break;
case 2: // R
_MovementVector3 = MoveSpeed * Vector3.right;
break;
case 3: // Up
_MovementVector3 = MoveSpeed * Vector3.forward;
break;
case 4: //down
_MovementVector3 = MoveSpeed * Vector3.back;
break;
case 0: _MovementVector3 = Vector3.zero;
break;
}
_thirdPersonController.Move(_MovementVector3, false, jump);
jump = false;
if(_IsTouchingWall
&& _TimeTouchingWall > 75f)
{
_IsTouchingWall = false;
_TimeTouchingWall = 0f;
SetReward(-.1f);
EndEpisode();
_OnEpisodeFail?.Invoke();
}
}
private void FixedUpdate()
{
if(_IsTouchingWall)
{
_TimeTouchingWall += 1.0f * Time.fixedDeltaTime;
}
if(transform.position.y < -10f)
{
SetReward(-0.5f);
EndEpisode();
}
}
private void OnCollisionExit(Collision collision) => OnCollisionTriggerExit(collision.collider);
private void OnTriggerExit(Collider other) => OnCollisionTriggerExit(other);
private void OnCollisionEnter(Collision collision) => OnCollisionTriggerEnterStay(collision.collider);
private void OnTriggerEnter(Collider other) => OnCollisionTriggerEnterStay(other);
private void OnCollisionTriggerExit(Collider other)
{
if((other.gameObject.tag == "Water"
|| other.gameObject.tag == "Obstacle") && _IsTouchingWall)
{
_IsTouchingWall = false;
}
}
private void OnCollisionTriggerEnterStay(Collider other)
{
if (other.gameObject.tag == "Goal")
{
SetReward(1f);
EndEpisode();
_OnEpisodePass?.Invoke();
}
if (other.gameObject.tag == "Water"
|| other.gameObject.tag == "Obstacle")
{
if (_IsTouchingWall == false) _IsTouchingWall = true;
AddReward(-.075f);
//EndEpisode();
//_OnEpisodeFail?.Invoke();
}
}
private Vector3 NormalizePositions(Vector3 input, Vector3 min, Vector3 max, out Vector3 vec)
{
vec.x = (input.x - min.x) / (max.x - min.x);
vec.y = (input.y - min.y) / (max.y - min.y);
vec.z = (input.z - min.z) / (max.z - min.z);
return vec;
}
public override void CollectObservations(VectorSensor sensor)
{
// the bounds are determined on the fly when level starts,
// so this is a simple getter.
var b = _LevelBounds.GetGroupedBounds;
/// OBSERVATION #1 - 3x float (x, y, z)
if (ExtractionPoint != null)
{
NormalizePositions(ExtractionPoint.position, b.min, b.max, out NormalizedDistanceFromGoal);
sensor.AddObservation(NormalizedDistanceFromGoal);
}
else
{
Debug.Log($"My extraction point is null.", gameObject);
sensor.AddObservation(Vector3.zero);
}
NormalizePositions(transform.position, b.min, b.max, out NormalizedPlayerPosition);
NormalizedPlayerPosition.y += .28f;
// 3x float (x,y,z)
sensor.AddObservation(NormalizedPlayerPosition);
// 3x float (x,y,z)
sensor.AddObservation(transform.forward);
//AddReward(-0.00006f);
}
}
As is, if the AI happens to spawn close to the goal cube and can see it, goes right for it. Other than that, it seems to be “no thoughts, head empty” when it comes to interpreting the goal’s normalized position.
If there’s any other information you guys need, let me know. Thank you in advance for your help with figuring this new & exciting technology out!




