Initial goal of training was to drive through looped path with max speed. 2-3M steps lead to more or less realistic behaviour.
Current goal is to avoid obstacles.
Observations
gas/brake
steering angle
several ahead points param to predict turns (curvature)
several raycasts with static obstacles
Total: vector of 23 floats
Actions (Continuous)
gas/brake
steering
Total: 2
Reward system
To calculate reward I used the following params:
directionError: [-1;1] => [180;0] (unsigned angle between car “forward” and path direction)
gasBrakeBonus = speedIsLow ? gas - brake : 0f (a bit more complex to make it smooth)
outOfPathPenalty = distanceFromPath > 2 ? ((distanceFromPath - 2) / maxDistanceFromPath) (a bit more complex too to make it smooth)
Reward/penalty are splitted as following:
Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty (and divided by some factor to reduce each step reward and make critical outs or finish more meaningful)
Penalty on low speed during N seconds = -1, end of episode
Penalty on critically wrong direction (>90 deg) = -1, end of episode
Penalty on critical out of path = -1, end of episode (when distanceFromPath > maxDistanceFromPath)
Penalty on collision with obstacle = -1, end of episode
Checkpoint reward (in between of obstacles) = 0.1f
Reward on finish of path = 1f, end of episode
Rewards are aligned with following assumption:
Fast pass track w/o missed turns and w/o collisions with obstacles - it’s a goal, reward must be maximum (2.3 for 310 steps)
Slowly pass track to finish of path (2.3 for 630 steps)
Driving fast and out (or collide) (-0.01 for 340 steps)
Driving slowly and out (or collide) (-0.69 for 301 steps, -0.04 for 431 steps)
Standing still on start line (-0.99 for 252 steps)
So it seems reward is shaped good towards main goal#1, but training is not converge:
Could please someone point me what am i doing wrong or provide any suggestion?
I tried different configs, different reward system from totally sparse to much more supervised learning, but 5-8M steps do not lead to goal, car collides with 1st obstacle.
One note about train simplier case - just driving forward on minimal speed. It take about 100-300k steps to learn. And it can be faster with GAIL (commented), but GAIL do not bring success to avoid obstacles.
behaviors:
Vehicle:
trainer_type: sac
hyperparameters:
learning_rate: 0.0003
learning_rate_schedule: constant
batch_size: 128 #must be thousands for continuous actions (128-1024 for SAC)
buffer_size: 100000 #50k-1m for SAC, must be thousands of times larger then avg episode length
buffer_init_steps: 0 #1k-10k to help prefill exp buffer with random actions
tau: 0.005
steps_per_update: 10.0 #more value -> more efficient samples, =count of agents, high value -> high CPU
save_replay_buffer: false #dump experience buffer prior to exit training and load on resume
init_entcoef: 0.5 #0.5 - 1.0 for continuous, high value -> high exploration on begining -> faster training
reward_signal_steps_per_update: 10.0 #in general = steps_per_update
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
vis_encode_type: simple
# memory:
# sequence_length: 32
# memory_size: 128
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
# gail:
# strength: 0.1
# gamma: 0.99
# encoding_size: 128
# demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/
# learning_rate: 0.0003
# use_actions: false
# use_vail: false
keep_checkpoints: 5
max_steps: 12000000
time_horizon: 128 #low value -> less variety, more bias, low value can be better for high freq rewards
summary_freq: 3000
threaded: true
# behavioral_cloning:
# demo_path: d:/UnityProjects/RaceExperiment/Assets/MLDemos/DemoPath104.demo
# steps: 0
# strength: 1.0
# samples_per_update: 0
The observations look ok to me. Waypoints are in the agent’s local space, right?
You have quite a few rewards and penalties though. Each one introduces a potential risk of the agent exploiting some design flaw you might have overlooked. Many rewards also make the cumulative reward graph harder to interpret. I think you can simplify/combine most of the rewards you’ve listed by just setting a single one proportional to the dot product of the car’s velocity and the path direction. Maybe try constraining the car’s range of motion by placing barriers at the sides of the road and treat them like obstacles, so they can be detected by the raycasts.
Thank You for answer!
Waypoints passed to observations just as consequent angles (floats) between car direction and direction in ahead points.
What do you mean by “quite a few rewards and penalty”? Is it about “Reward on each step = speed * directionError + gasBrakeBonus - outOfPathPenalty” or overall rewards?
All other rewards except “checkpoints” are leads to end of episode (mark for learning to stop way to negative experience).
I used “opened” roads, w/o barriers and that’s why I need to force agent to move along the path by rewards and penalties by distance from path. There is observation + small smooth penalty when distance increasing + total large (-1) penalty for completely out of path cases. Therefore I’m using raycasting for obstacles and reward “path” for track. It seems completely orthogonal goals with separate observations for track and obstacles.
Is there any way to learn separately independent experiences and “union” them together later?
I’m afraid to get NN degradation if I’ll start from learning “test” tracks (gas/brake/steering with angle prediction) and continue with easy track but obstacles.
I think @mbaske is giving good advice here. Your reward function is complicated though it may correctly encourage the behavior you expect with some tuning. My advice would be to simplify the reward function to ensure that you can train properly in a setting thats more human understandable and then begin to shape as you desire changes in behavior. You may find that for some behaviors, no shaping is required!
Changed observations back to “distance” from path + 5 relative directions. (Currently 28 float observations).
Changed SetReward to AddReward and step reward = (distance + gasBrakeLowSpeed) * scaleFactor. Where gasBrakeLowSpeed = speed < minSpeed ? isBraking ? -1 : isGas ? 1 : 0 (linear version ofcourse).
In general no converge, car stands on start line, driving out of path (multiple times with same directions), crashes with 1st obstacle (0.5-1.5M steps). Very likely to local extrememum but how to find it?
And again, i added all of “critical” penalties:
low speed for 10 seconds = -1, EndEpisode
out of path: -0.1f, EndEpisode
wrong direction: -0.2f, EndEpisode
crash: -0.2f, EndEpisode
I don’t know how can I simplify reward more.
From the other side:
GAIL is required (all demos contain about 50-70 episodes, successful or not), car can’t even start w/o it (at least with provided sac config)
gail/strength and learning_rate must be less in order to Gail Loss converge
Decision period must be tuned, i had “1” and it took too much time to shift a bit forward
Smaller buffer_size making car more alive instead of repeat wrong actions
PS. One run found bug in code On each reset of episode, car was placed to “0” point of track and because of real physics with +/- delta. For “-” cases reward was larger and i got converge to best strategy: stand on start line :)))))