Hi everyone,
I’m working on a project in which an agent learns to park in a specific spot. The problem is that the agent never goes in the spot, it keeps mooving forward and backward in a local optimum.
Here is the environment: (the white slot is the target)
The agent position is generated randomly (in available places) at the beginning of each episode.
I used 20 total areas in the scene, so there are 20 agents learning.
Here is my configuration:
Observations:
- transform.localPosition.x; (normalized in [0, 1])
- transform.localPosition.z; (normalized in [0, 1])
- transform.localRotation.y; (normalized in [0, 1])
- Mathf.Abs(rBody.velocity.x * 3.6f / 100);
- Mathf.Abs(rBody.velocity.z * 3.6f / 100);
- transform.InverseTransformPoint(target.transform.position).x;
- transform.InverseTransformPoint(target.transform.position).z;
- anglular difference (y axis) between the car and the parking slot, so that 1 means parallel and 0 means
perpendicular; - RayPerceptionSensor3D with 6 rays per direction, 180°, detecting obstacles and other cars;
- RayPerceptionSensor3D with 5 rays per direction, 180°, detecting sidewalks and parking slot.
Action space:
- throttle, continuous in [-1, 1];
- steering, continuous in [-1, 1];
- brake, continuous in [0, 1].
Reward system:
- -0.001 for every step; (max step of each agent is set to 3000)
- -0.5 if it collides with sidewalk;
- -1 if it collides with obstacles or other cars;
- sqrt(dx^2 + dz^2)/10, considering dx and dz the difference between current distance and previous step distance, it’s a small amount added only if it gets closer to the target;
- 10 points if it stops in the parking slot (distance.x < 0.5 and distance.z < 0.5);
- reward based on angular difference between the car and the parking slot, 10 points if it’s parallel, 0 if it’s perpendicular.
-
- curiosity.
config.yaml:
behaviors:
ParkingAI:
trainer_type: ppo
hyperparameters:
batch_size: 2048
buffer_size: 8192
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 256
num_layers: 3
memory:
memory_size: 256
sequence_length: 512
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
gamma: 0.99
strength: 0.02
encoding_size: 256
learning_rate: 0.0003
keep_checkpoints: 5
max_steps: 50000000
time_horizon: 128
summary_freq: 30000
threaded: true
Tensorboard graphs:
Please, let me now what you think about my configuration, if it’s right and I just have to wait more (I saw about 300 - 400 episodes) or maybe I have to change something. Thank you.