basic tips to increase training stability

What parameters of the neural network are responsible for the stability of behavior?

My training (not stable):

[INFO] Agent. Step: 3000000. Time Elapsed: 6701.190 s. Mean Reward: 449.745. Std of Reward: 464.177. Training.
[INFO] Agent. Step: 3030000. Time Elapsed: 6767.475 s. Mean Reward: 281.317. Std of Reward: 303.813. Training.
[INFO] Agent. Step: 3060000. Time Elapsed: 6825.893 s. Mean Reward: 1024.422. Std of Reward: 1616.215. Training.
[INFO] Agent. Step: 3090000. Time Elapsed: 6891.545 s. Mean Reward: 333.737. Std of Reward: 343.476. Training.
[INFO] Agent. Step: 3120000. Time Elapsed: 6961.993 s. Mean Reward: 529.770. Std of Reward: 438.336. Training.
[INFO] Agent. Step: 3150000. Time Elapsed: 7028.978 s. Mean Reward: 386.342. Std of Reward: 240.528. Training.
[INFO] Agent. Step: 3180000. Time Elapsed: 7089.501 s. Mean Reward: 1242.240. Std of Reward: 1191.351. Training.
[INFO] Agent. Step: 3210000. Time Elapsed: 7162.898 s. Mean Reward: 471.763. Std of Reward: 76.120. Training.
[INFO] Agent. Step: 3240000. Time Elapsed: 7225.747 s. Mean Reward: 392.818. Std of Reward: 510.116. Training.

and my networks params:

behaviors:
Agent:
trainer_type: ppo
hyperparameters:
batch_size: 2048
buffer_size: 20480
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 512
num_layers: 3
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
keep_checkpoints: 30
checkpoint_interval: 1000000
max_steps: 50000000
time_horizon: 1000
summary_freq: 30000

Thanks!

It totally depends on what task you want to perform, in my case I reduced the size of the neural network and the training stabilized. My neural network has hidden_units=128, num_layers=2, but num_layers=3 sometimes made the training extremely slow and unstabilized.
Also, how you give a reward is also very important. You may need to adjust the way you give rewards to facilitate learning.

Your mean reward is as well really high. Are you rewarding the agent a lot? Are the rewards high in nature? (reward > 1).
Are you having a small decay on each step? Are you punishing wrong actions? Is your enviroment fixed or random? Why is it called always Agent and once SimpleWalker?

Thanks!
Im doing some areas like Walker/Crawler (Locomotion ) - so my reward the more it passes the better.
My character now have not bad walking , but sometimes it failed.
I will try to play to size of hidden_layers and other suggested.

[quote=“Zibelas”, post:3, topic: 885763]
Your mean reward is as well really high. Are you rewarding the agent a lot? Are the rewards high in nature? (reward > 1).
Are you having a small decay on each step? Are you punishing wrong actions? Is your enviroment fixed or random? Why is it called always Agent and once SimpleWalker?
[/quote]

No my reward designeed like it can be max 1 at each time stamp, as i say - with my task i have no limit - just infinite walking. Thats just copy/paste error .