Curriculum learning not being triggered?

Hi

Is it possible to mix curriculum learning with imitation learning? Maybe I’m configuring things wrong, but when I append my curriculum learning configs to imitation learning configs, it appears the curriculum part never gets triggered.

this is the config I’m using. Any hints or help would be appreciated.

behaviors:
PongBehavior:
trainer_type: ppo
hyperparameters:
batch_size: 256
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 128
num_layers: 3
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
gail:
gamma: 0.99
strength: 0.5
encoding_size: 128
learning_rate: 0.0003
demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo
behavioral_cloning:
strength: 0.9
demo_path: /home/Documents/ML-agents/ml-agents/CarPong/Assets/Demonstrations/PongBehavior.demo
keep_checkpoints: 5
max_steps: 1000000
time_horizon: 64
summary_freq: 10000
threaded: true
environment_parameters:
drone_targets_widths:
curriculum:
- name: Lesson0
completion_criteria:
measure: progress
behavior: PongBehavior
signal_smoothing: true
min_lesson_length: 10000
threshold: 0.1
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 0.0
- name: Lesson1
completion_criteria:
measure: progress
behavior: PongBehavior
signal_smoothing: true
min_lesson_length: 10000
threshold: 0.3
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 5.0
- name: Lesson2
completion_criteria:
measure: progress
behavior: PongBehavior
signal_smoothing: true
min_lesson_length: 10000
threshold: 0.65
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 15.0
- name: Lesson3
completion_criteria:
measure: progress
behavior: PongBehavior
signal_smoothing: true
min_lesson_length: 10000
threshold: 0.8
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 25.0
- name: Lesson4
completion_criteria:
measure: reward
behavior: PongBehavior
signal_smoothing: true
min_lesson_length: 10000
threshold: 200
value:
sampler_type: uniform
sampler_parameters:
min_value: 0.0
max_value: 50.0

Thanks!

Hi @kt66nfkim ,
This sounds reasonable to me. I’m reaching out some folks on the research team to find out.

When you say the curriculum part never gets triggered, do you mean that you never leave lesson one even if your steps go over 100k? I see that your max_steps are set to 1,000,000 and your threshold for lesson 0 is .1. So you’d need to wait for 100k steps to move to the next lesson. Is that correct?

Hi @christophergoy

Yes, that’s exactly what I’m seeing. Even after 100k steps the environment never moves onto the next lesson. Here’s a small sample output from one training run I tried using the config file I shared earlier. You can see that even after 200k steps the environment stays in the same lesson.

I also double checked with and used similar config on PPO and it seems to be working fine so I’m little lost on why it doesn’t seem to work with imitation learning.

2021-03-30 12:27:17 INFO [stats.py:139] PongBehavior. Step: 10000. Time Elapsed: 72.417 s. Mean Reward: 137.500. Std of Reward: 131.696. Training.
2021-03-30 12:28:07 INFO [stats.py:139] PongBehavior. Step: 20000. Time Elapsed: 122.900 s. Mean Reward: 12.500. Std of Reward: 92.702. Training.
2021-03-30 12:28:59 INFO [stats.py:139] PongBehavior. Step: 30000. Time Elapsed: 174.495 s. Mean Reward: 227.080. Std of Reward: 195.504. Training.
2021-03-30 12:29:50 INFO [stats.py:139] PongBehavior. Step: 40000. Time Elapsed: 226.001 s. Mean Reward: 455.867. Std of Reward: 320.745. Training.
2021-03-30 12:30:42 INFO [stats.py:139] PongBehavior. Step: 50000. Time Elapsed: 278.101 s. Mean Reward: 779.522. Std of Reward: 340.896. Training.
2021-03-30 12:31:34 INFO [stats.py:139] PongBehavior. Step: 60000. Time Elapsed: 330.013 s. Mean Reward: 762.778. Std of Reward: 388.825. Training.
2021-03-30 12:32:26 INFO [stats.py:139] PongBehavior. Step: 70000. Time Elapsed: 381.935 s. Mean Reward: 945.360. Std of Reward: 452.997. Training.
2021-03-30 12:33:18 INFO [stats.py:139] PongBehavior. Step: 80000. Time Elapsed: 433.979 s. Mean Reward: 693.067. Std of Reward: 440.927. Training.
2021-03-30 12:34:10 INFO [stats.py:139] PongBehavior. Step: 90000. Time Elapsed: 485.811 s. Mean Reward: 1177.640. Std of Reward: 428.542. Training.
2021-03-30 12:35:02 INFO [stats.py:139] PongBehavior. Step: 100000. Time Elapsed: 537.423 s. Mean Reward: 1004.973. Std of Reward: 457.758. Training.
2021-03-30 12:35:54 INFO [stats.py:139] PongBehavior. Step: 110000. Time Elapsed: 589.437 s. Mean Reward: 635.436. Std of Reward: 326.039. Training.
2021-03-30 12:36:46 INFO [stats.py:139] PongBehavior. Step: 120000. Time Elapsed: 641.146 s. Mean Reward: 903.590. Std of Reward: 509.251. Training.
2021-03-30 12:37:38 INFO [stats.py:139] PongBehavior. Step: 130000. Time Elapsed: 693.732 s. Mean Reward: 590.364. Std of Reward: 313.053. Training.
2021-03-30 12:38:30 INFO [stats.py:139] PongBehavior. Step: 140000. Time Elapsed: 746.029 s. Mean Reward: 822.222. Std of Reward: 428.895. Training.
2021-03-30 12:39:22 INFO [stats.py:139] PongBehavior. Step: 150000. Time Elapsed: 797.526 s. Mean Reward: 862.636. Std of Reward: 306.899. Training.
2021-03-30 12:40:14 INFO [stats.py:139] PongBehavior. Step: 160000. Time Elapsed: 849.556 s. Mean Reward: 714.709. Std of Reward: 436.609. Training.
2021-03-30 12:41:08 INFO [stats.py:139] PongBehavior. Step: 170000. Time Elapsed: 903.538 s. Mean Reward: 951.009. Std of Reward: 417.411. Training.
2021-03-30 12:42:02 INFO [stats.py:139] PongBehavior. Step: 180000. Time Elapsed: 957.181 s. Mean Reward: 780.618. Std of Reward: 340.392. Training.
2021-03-30 12:42:55 INFO [stats.py:139] PongBehavior. Step: 190000. Time Elapsed: 1010.496 s. Mean Reward: 711.111. Std of Reward: 414.848. Training.
2021-03-30 12:43:48 INFO [stats.py:139] PongBehavior. Step: 200000. Time Elapsed: 1063.520 s. Mean Reward: 432.400. Std of Reward: 268.492. Training.
2021-03-30 12:44:40 INFO [stats.py:139] PongBehavior. Step: 210000. Time Elapsed: 1115.521 s. Mean Reward: 535.240. Std of Reward: 322.129. Training.
2021-03-30 12:45:31 INFO [stats.py:139] PongBehavior. Step: 220000. Time Elapsed: 1167.089 s. Mean Reward: 561.925. Std of Reward: 486.984. Training.
2021-03-30 12:46:23 INFO [stats.py:139] PongBehavior. Step: 230000. Time Elapsed: 1218.883 s. Mean Reward: 441.390. Std of Reward: 400.383. Training.
2021-03-30 12:47:15 INFO [stats.py:139] PongBehavior. Step: 240000. Time Elapsed: 1270.426 s. Mean Reward: 780.022. Std of Reward: 365.279. Training.

Someone on our research team just tried it out with the Pushblock environment and it worked for them. So maybe there is a formatting issue in your yaml? I’m not quite sure.

Do you see the curriculum printed out when you start training as part of the training config?

Am I supposed to see curriculum printed out? Even when I run the wall jump example environment I don’t see curriculum being printed out even though it curriculum learning works for that env. I’ll try to see if there’s something wrong with the format of my config.

Do you think the person on the research team would mind if he/she shared the config for running curriculum learning in imitation setup?

2021-03-30 13:13:18 INFO [learn.py:275] run_seed set to 1263
2021-03-30 13:13:18 INFO [environment.py:205] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2021-03-30 13:13:56 INFO [environment.py:112] Connected to Unity environment with package version 1.6.0-preview and communication version 1.2.0
2021-03-30 13:13:56 INFO [environment.py:271] Connected new brain:
SmallWallJump?team=0
2021-03-30 13:13:56 WARNING [stats.py:190] events.out.tfevents.1617076665.youngwook-pc.2198895.0 was left over from a previous run. Deleting.
2021-03-30 13:13:56 INFO [stats.py:147] Hyperparameters for behavior name SmallWallJump: 
trainer_type: ppo
hyperparameters: 
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings: 
normalize: False
hidden_units: 256
num_layers: 2
vis_encode_type: simple
memory: None
reward_signals: 
extrinsic: 
gamma: 0.99
strength: 1.0
init_path: None
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 128
summary_freq: 20000
threaded: True
self_play: None
behavioral_cloning: None
framework: pytorch
2021-03-30 13:13:58 INFO [environment.py:271] Connected new brain:
BigWallJump?team=0
2021-03-30 13:13:58 WARNING [stats.py:190] events.out.tfevents.1617076667.youngwook-pc.2198895.1 was left over from a previous run. Deleting.
2021-03-30 13:13:58 INFO [stats.py:147] Hyperparameters for behavior name BigWallJump: 
trainer_type: ppo
hyperparameters: 
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings: 
normalize: False
hidden_units: 256
num_layers: 2
vis_encode_type: simple
memory: None
reward_signals: 
extrinsic: 
gamma: 0.99
strength: 1.0
init_path: None
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 20000000
time_horizon: 128
summary_freq: 20000
threaded: True
self_play: None
behavioral_cloning: None
framework: pytorch
2021-03-30 13:15:05 INFO [stats.py:139] BigWallJump. Step: 20000. Time Elapsed: 107.393 s. Mean Reward: -0.881. Std of Reward: 0.630. Training.
2021-03-30 13:15:43 INFO [stats.py:139] SmallWallJump. Step: 20000. Time Elapsed: 145.265 s. Mean Reward: -0.824. Std of Reward: 0.625. Training.
2021-03-30 13:16:23 INFO [stats.py:139] BigWallJump. Step: 40000. Time Elapsed: 184.477 s. Mean Reward: 0.009. Std of Reward: 0.934. Training.
2021-03-30 13:17:27 INFO [stats.py:139] SmallWallJump. Step: 40000. Time Elapsed: 249.388 s. Mean Reward: -0.000. Std of Reward: 0.943. Training.
2021-03-30 13:17:32 INFO [stats.py:139] BigWallJump. Step: 60000. Time Elapsed: 254.267 s. Mean Reward: 0.214. Std of Reward: 0.899. Training.
2021-03-30 13:18:18 INFO [environment_parameter_manager.py:155] Parameter 'small_wall_height' has been updated to Float: value=2.0. Now in lesson 'Lesson1'
2021-03-30 13:18:46 INFO [stats.py:139] BigWallJump. Step: 80000. Time Elapsed: 327.867 s. Mean Reward: 0.407. Std of Reward: 0.816. Training.
2021-03-30 13:19:12 INFO [stats.py:139] SmallWallJump. Step: 60000. Time Elapsed: 353.467 s. Mean Reward: 0.128. Std of Reward: 0.927. Training.
2021-03-30 13:20:07 INFO [stats.py:139] BigWallJump. Step: 100000. Time Elapsed: 408.920 s. Mean Reward: 0.528. Std of Reward: 0.743. Training.
2021-03-30 13:20:47 INFO [stats.py:139] SmallWallJump. Step: 80000. Time Elapsed: 448.724 s. Mean Reward: 0.430. Std of Reward: 0.792. Training.
2021-03-30 13:21:23 INFO [stats.py:139] BigWallJump. Step: 120000. Time Elapsed: 484.553 s. Mean Reward: 0.644. Std of Reward: 0.648. Training.
2021-03-30 13:22:23 INFO [stats.py:139] SmallWallJump. Step: 100000. Time Elapsed: 544.860 s. Mean Reward: 0.439. Std of Reward: 0.793. Training.
2021-03-30 13:22:45 INFO [stats.py:139] BigWallJump. Step: 140000. Time Elapsed: 567.327 s. Mean Reward: 0.641. Std of Reward: 0.660. Training.
2021-03-30 13:24:02 INFO [stats.py:139] BigWallJump. Step: 160000. Time Elapsed: 644.375 s. Mean Reward: 0.732. Std of Reward: 0.537. Training.
2021-03-30 13:24:07 INFO [stats.py:139] SmallWallJump. Step: 120000. Time Elapsed: 648.766 s. Mean Reward: 0.557. Std of Reward: 0.710. Training.
2021-03-30 13:25:13 INFO [stats.py:139] BigWallJump. Step: 180000. Time Elapsed: 715.065 s. Mean Reward: 0.740. Std of Reward: 0.529. Training.
2021-03-30 13:26:10 INFO [stats.py:139] SmallWallJump. Step: 140000. Time Elapsed: 772.192 s. Mean Reward: 0.596. Std of Reward: 0.687. Training.
2021-03-30 13:26:25 INFO [stats.py:139] BigWallJump. Step: 200000. Time Elapsed: 786.615 s. Mean Reward: 0.716. Std of Reward: 0.551. Training.
2021-03-30 13:26:25 INFO [environment_parameter_manager.py:155] Parameter 'big_wall_height' has been updated to Uniform sampler: min=4.0, max=7.0. Now in lesson 'Lesson1'

@christophergoy I’m actually able to get curriculum to run now. I guess it was a formatting issue. Thanks for all the help!

1 Like