Problem with training self-driving car

Hi All!

I have the following environment:

  • Self-driving car (gas, brake, steering)
  • Path of waypoints
  • Static obstacles on path

Initial goal of training was to drive through looped path with max speed. 2-3M steps lead to more or less realistic behaviour.

Current goal is to avoid obstacles.

Observations

  • gas/brake
  • steering angle
  • several ahead points param to predict turns (curvature)
  • several raycasts with static obstacles

Total: vector of 23 floats

Actions (Continuous)

  • gas/brake
  • steering

Total: 2

Reward system
To calculate reward I used the following params:

  • directionError: [-1;1] => [180;0] (unsigned angle between car “forward” and path direction)
  • gasBrakeBonus = speedIsLow ? gas - brake : 0f (a bit more complex to make it smooth)
  • outOfPathPenalty = distanceFromPath > 2 ? ((distanceFromPath - 2) / maxDistanceFromPath) (a bit more complex too to make it smooth)

Reward/penalty are splitted as following:

  • Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty (and divided by some factor to reduce each step reward and make critical outs or finish more meaningful)
  • Penalty on low speed during N seconds = -1, end of episode
  • Penalty on critically wrong direction (>90 deg) = -1, end of episode
  • Penalty on critical out of path = -1, end of episode (when distanceFromPath > maxDistanceFromPath)
  • Penalty on collision with obstacle = -1, end of episode
  • Checkpoint reward (in between of obstacles) = 0.1f
  • Reward on finish of path = 1f, end of episode

Rewards are aligned with following assumption:

  1. Fast pass track w/o missed turns and w/o collisions with obstacles - it’s a goal, reward must be maximum (2.3 for 310 steps)
  2. Slowly pass track to finish of path (2.3 for 630 steps)
  3. Driving fast and out (or collide) (-0.01 for 340 steps)
  4. Driving slowly and out (or collide) (-0.69 for 301 steps, -0.04 for 431 steps)
  5. Standing still on start line (-0.99 for 252 steps)

So it seems reward is shaped good towards main goal#1, but training is not converge:

Could please someone point me what am i doing wrong or provide any suggestion?

I tried different configs, different reward system from totally sparse to much more supervised learning, but 5-8M steps do not lead to goal, car collides with 1st obstacle.
One note about train simplier case - just driving forward on minimal speed. It take about 100-300k steps to learn. And it can be faster with GAIL (commented), but GAIL do not bring success to avoid obstacles.

behaviors:
  Vehicle:
    trainer_type: sac
    hyperparameters:
      learning_rate: 0.0003
      learning_rate_schedule: constant
      batch_size: 128 #must be thousands for continuous actions (128-1024 for SAC)
      buffer_size: 100000 #50k-1m for SAC, must be thousands of times larger then avg episode length
      buffer_init_steps: 0 #1k-10k to help prefill exp buffer with random actions
      tau: 0.005
      steps_per_update: 10.0 #more value -> more efficient samples, =count of agents, high value -> high CPU
      save_replay_buffer: false #dump experience buffer prior to exit training and load on resume
      init_entcoef: 0.5 #0.5 - 1.0 for continuous, high value -> high exploration on begining -> faster training
      reward_signal_steps_per_update: 10.0 #in general = steps_per_update
    network_settings:
      normalize: false
      hidden_units: 128
      num_layers: 2
      vis_encode_type: simple
#      memory:
#        sequence_length: 32
#        memory_size: 128
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
#      gail:
#        strength: 0.1
#        gamma: 0.99
#        encoding_size: 128
#        demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/
#        learning_rate: 0.0003
#        use_actions: false
#        use_vail: false
    keep_checkpoints: 5
    max_steps: 12000000
    time_horizon: 128 #low value -> less variety, more bias, low value can be better for high freq rewards
    summary_freq: 3000
    threaded: true
#    behavioral_cloning:
#      demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/DemoPath104.demo
#      steps: 0
#      strength: 1.0
#      samples_per_update: 0

6079845--659751--learning-8m-steps-avoid-obstacles.png

The observations look ok to me. Waypoints are in the agent’s local space, right?
You have quite a few rewards and penalties though. Each one introduces a potential risk of the agent exploiting some design flaw you might have overlooked. Many rewards also make the cumulative reward graph harder to interpret. I think you can simplify/combine most of the rewards you’ve listed by just setting a single one proportional to the dot product of the car’s velocity and the path direction. Maybe try constraining the car’s range of motion by placing barriers at the sides of the road and treat them like obstacles, so they can be detected by the raycasts.

Thank You for answer!
Waypoints passed to observations just as consequent angles (floats) between car direction and direction in ahead points.

What do you mean by “quite a few rewards and penalty”? Is it about “Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty” or overall rewards?
All other rewards except “checkpoints” are leads to end of episode (mark for learning to stop way to negative experience).
I used “opened” roads, w/o barriers and that’s why I need to force agent to move along the path by rewards and penalties by distance from path. There is observation + small smooth penalty when distance increasing + total large (-1) penalty for completely out of path cases. Therefore I’m using raycasting for obstacles and reward “path” for track. It seems completely orthogonal goals with separate observations for track and obstacles.

Is there any way to learn separately independent experiences and “union” them together later?
I’m afraid to get NN degradation if I’ll start from learning “test” tracks (gas/brake/steering with angle prediction) and continue with easy track but obstacles.

I think @mbaske is giving good advice here. Your reward function is complicated though it may correctly encourage the behavior you expect with some tuning. My advice would be to simplify the reward function to ensure that you can train properly in a setting thats more human understandable and then begin to shape as you desire changes in behavior. You may find that for some behaviors, no shaping is required!

Simplified step reward function to distance * scaleFactor. And got 400k steps to learn car to at least move forward a bit.
Also tried the following:

  • Changed observations to “vector” form: a fewer raycasting + 5 relative points + 5 relative directions ahead on path (http://auro.ai/blog/2020/05/learning-to-drive/)
  • Changed observations back to “distance” from path + 5 relative directions. (Currently 28 float observations).
  • Changed SetReward to AddReward and step reward = (distance + gasBrakeLowSpeed) * scaleFactor. Where gasBrakeLowSpeed = speed < minSpeed ? isBraking ? -1 : isGas ? 1 : 0 (linear version ofcourse).

In general no converge, car stands on start line, driving out of path (multiple times with same directions), crashes with 1st obstacle (0.5-1.5M steps). Very likely to local extrememum but how to find it?

And again, i added all of “critical” penalties:

  • low speed for 10 seconds = -1, EndEpisode
  • out of path: -0.1f, EndEpisode
  • wrong direction: -0.2f, EndEpisode
  • crash: -0.2f, EndEpisode

I don’t know how can I simplify reward more.

From the other side:

  • GAIL is required (all demos contain about 50-70 episodes, successful or not), car can’t even start w/o it (at least with provided sac config)
  • gail/strength and learning_rate must be less in order to Gail Loss converge
  • Decision period must be tuned, i had “1” and it took too much time to shift a bit forward
  • Smaller buffer_size making car more alive instead of repeat wrong actions

PS. One run found bug in code :slight_smile: On each reset of episode, car was placed to “0” point of track and because of real physics with +/- delta. For “-” cases reward was larger and i got converge to best strategy: stand on start line :)))))