I am trying to train a drone to chase another drone for a combat simulator (ztag.com is our real life game we’re trying to simulate). I’ve modeled much of the ML code from the hummingbird example (in a given space, the agent finds the nearest position and flies to it) except in my example, the target location is moving in 3d.
I started with randomly generating start locations for both agent and target drones near the center of the course and gradually increase the spawn diameter as rewarded activity occurs. The agent uses ray casts to sense the surroundings as well as the following sensor vectors:
- local rotation (4 observations)
- normalized vector pointing to target (3 observations)
- dot product to show if agent pointing at target (1 obs)
- dot product to show if agent’s orientation is parallel to that of target (1 obs)
- normalized distance to other drone (1 obs) <— I found this may actually get up to 2, instead of 1…(perhaps a source of problems?)
The rewards are:
- -1 for bumping into obstacles
- +0.1 for each timestep target drone’s rear is within the agent’s collision cone (about 15m and 45 degree spread forward)
the agent starts out learning rapidly as it stops running into obstacles as short as 250k steps but after it starts succeeding in tagging the other drone (by getting within 15 meters directly behind the target) we see a gradual fall off in performance. It seems the trend is that more data won’t help at this point.
I admit I’m quite new to ml agents so I certainly don’t understand all the parameters I can tweak, I simply took what worked for the hummingbird project and applied it here.
Here’s how I choose a new waypoint for the target drone to fly to:
public void chooseRandomWayPoint()
{
Debug.Log(“New waypoint”);
newWayPointTimeOut = 15f;
currentWaypoint = Environment.transform.position+
new Vector3(Random.Range(-difficultyDistance, difficultyDistance),
Random.Range(25-difficultyDistance/5f, 25+difficultyDistance/5f),
Random.Range(-difficultyDistance, difficultyDistance));
}
Here’s my YAML:
behaviors:
ChaseGoal:
trainer_type: ppo
hyperparameters:
Hyperparameters common to PPO and SAC
batch_size: 2048
buffer_size: 20480
learning_rate: 3.0e-4
learning_rate_schedule: linear
PPO-specific hyperparameters
Replaces the “PPO-specific hyperparameters” section above
beta: 5.0e-3
epsilon: 0.2
lambd: 0.95
num_epoch: 3
Configuration of the neural network (common to PPO/SAC)
network_settings:
vis_encode_type: simple
normalize: false
hidden_units: 256
num_layers: 2
memory
memory:
sequence_length: 64
memory_size: 256
Trainer configurations common to all trainers
max_steps: 5.0e6
time_horizon: 128
summary_freq: 10000
Questions:
- What am I doing wrong?
- Is there a way to resume training from the peak of the last performance data?
Thanks for any advice!!
-Quan