Autonomous parking problem

Hi everyone,
I’m working on a project in which an agent learns to park in a specific spot. The problem is that the agent never goes in the spot, it keeps mooving forward and backward in a local optimum.

Here is the environment: (the white slot is the target)


The agent position is generated randomly (in available places) at the beginning of each episode.
I used 20 total areas in the scene, so there are 20 agents learning.

Here is my configuration:

Observations:

  • transform.localPosition.x; (normalized in [0, 1])
  • transform.localPosition.z; (normalized in [0, 1])
  • transform.localRotation.y; (normalized in [0, 1])
  • Mathf.Abs(rBody.velocity.x * 3.6f / 100);
  • Mathf.Abs(rBody.velocity.z * 3.6f / 100);
  • transform.InverseTransformPoint(target.transform.position).x;
  • transform.InverseTransformPoint(target.transform.position).z;
  • anglular difference (y axis) between the car and the parking slot, so that 1 means parallel and 0 means
    perpendicular;
  • RayPerceptionSensor3D with 6 rays per direction, 180°, detecting obstacles and other cars;
  • RayPerceptionSensor3D with 5 rays per direction, 180°, detecting sidewalks and parking slot.

Action space:

  • throttle, continuous in [-1, 1];
  • steering, continuous in [-1, 1];
  • brake, continuous in [0, 1].

Reward system:

  • -0.001 for every step; (max step of each agent is set to 3000)
  • -0.5 if it collides with sidewalk;
  • -1 if it collides with obstacles or other cars;
  • sqrt(dx^2 + dz^2)/10, considering dx and dz the difference between current distance and previous step distance, it’s a small amount added only if it gets closer to the target;
  • 10 points if it stops in the parking slot (distance.x < 0.5 and distance.z < 0.5);
  • reward based on angular difference between the car and the parking slot, 10 points if it’s parallel, 0 if it’s perpendicular.
    • curiosity.

config.yaml:

behaviors:
  ParkingAI:
    trainer_type: ppo
    hyperparameters:
      batch_size: 2048
      buffer_size: 8192
      learning_rate: 0.0003
      beta: 0.01
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 3
      memory:
        memory_size: 256
        sequence_length: 512
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        gamma: 0.99
        strength: 0.02
        encoding_size: 256
        learning_rate: 0.0003
    keep_checkpoints: 5
    max_steps: 50000000
    time_horizon: 128
    summary_freq: 30000
    threaded: true

Tensorboard graphs:


Please, let me now what you think about my configuration, if it’s right and I just have to wait more (I saw about 300 - 400 episodes) or maybe I have to change something. Thank you.

Rewarding distance changes can result in oscillating behaviour if the agent figures out that it can maximize rewards by repeatedly moving forward and backward. You could try rewarding the agent for getting closer to the target, but also assign a proportional penalty for moving away from it.

Thanks for your quick reply.
Yes, before using curiosity I’d also tried adding a negative reward for moving away, but in that case the agent wasn’t moving. Also, I’ve read in another post here that adding negative reward for actions I don’t want it to do could end up discouraging the agent moving, because he learns that when he moves he has 50% chance of getting positive reward and 50% chance of getting negative reward.
However, since I’m also using curiosity I’ll try that, thanks.
Is there anything else I can change to improve my configuration?

You could try keeping track of the “best” distance so far, and only give the reward when that decreases. So something like (in pseudocode)

if currentDistance < bestDistance:
  agent.AddReward(k * (bestDistance - currentDistance))
  bestDistance = currentDistance

This is a great idea, thank you.
I’m applying a lot of changes, I’ll let you know about results in the next days.
How many steps should I have to wait, for every simulation, in order to actually see if it’s working?