This is my first attempt at training an agent, so I wanted to something simple - train a capsule shaped agent to balance itself so that it doesn’t tip over and fall.
Here is the agent script I wrote, please forgive me if there are any eye-gouging coding practice blunders. I’m taking the x and z rotations and velocities as observations and the agent can add force to move itself. The magic number 0.2588191 is simply a 30 degree rotation value. So if the agent falls over the edge or rotates over 30 degrees it loses 1 reward. Each step the agent also receives a 0.01f reward.
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
public class BalancerAgent : Agent
{
Rigidbody rBody;
void Start()
{
rBody = GetComponent<Rigidbody>();
}
public override void OnEpisodeBegin(){
if(Mathf.Abs(this.transform.rotation.x) > 0.2588191 || Mathf.Abs(this.transform.rotation.z) > 0.2588191 || this.transform.position.y < 0){
this.rBody.angularVelocity = Vector3.zero;
this.rBody.velocity = Vector3.zero;
var rotationVector = transform.rotation.eulerAngles;
rotationVector.x = 1;
rotationVector.z = 0;
this.transform.rotation = Quaternion.Euler(rotationVector);
this.transform.localPosition = new Vector3(0,1,0);
}
}
public override void CollectObservations(VectorSensor sensor){
// Agent velocity
sensor.AddObservation(rBody.velocity.x);
sensor.AddObservation(rBody.velocity.z);
// Agent rotation
sensor.AddObservation(this.transform.rotation.x);
sensor.AddObservation(this.transform.rotation.z);
}
public float forceMultiplier = 10;
public override void OnActionReceived(float[] vectorAction){
Vector3 controlSignal = Vector3.zero;
controlSignal.x = vectorAction[0];
controlSignal.z = vectorAction[1];
rBody.AddForce(controlSignal * forceMultiplier);
// Fell off platform
if(this.transform.localPosition.y < 0){
SetReward(-1f);
EndEpisode();
}
// if fell over
if(Mathf.Abs(this.transform.rotation.x) > 0.2588191 || Mathf.Abs(this.transform.rotation.z) > 0.2588191){
SetReward(-1f);
EndEpisode();
}
SetReward(0.01f);
}
public override void Heuristic(float[] actionsOut){
actionsOut[0] = Input.GetAxis("Horizontal");
actionsOut[1] = Input.GetAxis("Vertical");
}
}
I’ve played around with various batch_size, buffer_size, time_horizon, learning rate values but the highest reward achieved was around 12 (currently around 3-4 with these values) and it would go up and down from there, not really making progress.
I’ve spend a good few hours trying to solve this issue and it’s been great learning experience but at this point I just want to figure out what the deal is! Is the issue with my script or the hyperparameters?
If such an issue were to happen in the future, where an agent is stuck at a certain reward level, should I assume that it’s the hypervalues that need tweaking or are there other common issues?
Thank you so much for reading this, I hope you can help me out here
Edit:
In Unity I’m using max step of 4000 for each agent, space size 4 for observations and 2 for actions and decision period of 3.
Hi @MrpHDanny ,
The AddForce function adds force in world space coordinates. It may benefit you to try AddRelative force in order to add forces relative to the rigidBody’s transform.
Hi @christophergoy ,
Thanks for the suggestion. I’ve changed it to relative force but it’s had no effect on the training. Running the training again yields these results:
2020-12-22 19:40:53 INFO [stats.py:139] Balancing. Step: 12000. Time Elapsed: 20.238 s. Mean Reward: -0.837. Std of Reward: 0.046. Training.
2020-12-22 19:41:06 INFO [stats.py:139] Balancing. Step: 24000. Time Elapsed: 33.526 s. Mean Reward: -0.830. Std of Reward: 0.053. Training.
2020-12-22 19:41:19 INFO [stats.py:139] Balancing. Step: 36000. Time Elapsed: 45.982 s. Mean Reward: -0.833. Std of Reward: 0.049. Training.
2020-12-22 19:41:32 INFO [stats.py:139] Balancing. Step: 48000. Time Elapsed: 59.518 s. Mean Reward: -0.835. Std of Reward: 0.051. Training.
2020-12-22 19:41:45 INFO [stats.py:139] Balancing. Step: 60000. Time Elapsed: 72.316 s. Mean Reward: -0.829. Std of Reward: 0.051. Training.
2020-12-22 19:41:59 INFO [stats.py:139] Balancing. Step: 72000. Time Elapsed: 86.083 s. Mean Reward: -0.825. Std of Reward: 0.052. Training.
2020-12-22 19:42:12 INFO [stats.py:139] Balancing. Step: 84000. Time Elapsed: 98.577 s. Mean Reward: -0.821. Std of Reward: 0.055. Training.
2020-12-22 19:42:25 INFO [stats.py:139] Balancing. Step: 96000. Time Elapsed: 111.572 s. Mean Reward: -0.819. Std of Reward: 0.055. Training.
2020-12-22 19:42:37 INFO [stats.py:139] Balancing. Step: 108000. Time Elapsed: 123.668 s. Mean Reward: -0.828. Std of Reward: 0.052. Training.
2020-12-22 19:42:49 INFO [stats.py:139] Balancing. Step: 120000. Time Elapsed: 136.171 s. Mean Reward: -0.830. Std of Reward: 0.051. Training.
2020-12-22 19:43:03 INFO [stats.py:139] Balancing. Step: 132000. Time Elapsed: 149.995 s. Mean Reward: -0.822. Std of Reward: 0.053. Training.
2020-12-22 19:43:15 INFO [stats.py:139] Balancing. Step: 144000. Time Elapsed: 162.253 s. Mean Reward: -0.822. Std of Reward: 0.057. Training.
2020-12-22 19:43:28 INFO [stats.py:139] Balancing. Step: 156000. Time Elapsed: 174.673 s. Mean Reward: -0.829. Std of Reward: 0.056. Training.
2020-12-22 19:43:40 INFO [stats.py:139] Balancing. Step: 168000. Time Elapsed: 187.309 s. Mean Reward: -0.828. Std of Reward: 0.050. Training.
What does your environment look like? Is the capsule on a platform?
Is there a reason you are using the properties directly from the Quaternion instead of getting them in Euler values instead? Seems like it might be better to use Euler angles for checks since I don’t know exactly what values for Quaternion mean based on other factors. They can change unintuitively since they are using an imaginary number space to calculate rotations.
Yes, the scene is simply an even platform with the agent in the middle of it. No changes are made to the environment.
Instead of the magic number that I used before I now use Vector3.Angle to calculate how much the agent has rotated. So instead it is now:
float angle = Vector3.Angle(Vector3.up,transform.up);
// if fell over
if(angle > 30){
SetReward(-1f);
EndEpisode();
}
SetReward(0.01f);
How many agents are you learning on and how many instances of the environment are you running?
Your batch and buffer size are way too small if you are doing something like n agents times p environments.
Crank that learning rate up, your std of reward is tiny. It should be bouncing around to get the agent out of local maxima (but not so erratic that it can’t learn).
Add some hidden layers, it’s easier to set that high and assume you’re over-fitting than it is to guess what’s going wrong when the agent won’t train (which it won’t do with too few deep layers).
Reward Sculpting -
Your reward system looks really harsh to me, when agents are over punished they can get stuck trying to minimize the punishment instead of optimizing the rewards. This can lead to behaviors like purposely resetting themselves.
I try to only do a -1 reset from unrecoverable positions at first. You can always optimize to reset more often later to speed up training.
You are always rewarding the agent +0.01 anytime it isn’t being reset. This doesn’t give the agents any feed back about which actions are ‘more right’ than others. Try rewarding the agent more when it is straight up and less (while still giving a reward) when the agents is near a tipping point or about to be reset.
Your max steps per agent is large, with a fixed tilmestep of 0.02 (default) at 50 steps/second you are expecting the agent to stand for 80 seconds. I’d shorten that to something more attainable and reward the agent at the end for making it. You can always add more steps once you know it’s progressing.