I’m trying to create an agent which can traverse a maze-like environment, but the resulting behaviour has not lived up to my expectations.
I have developed a environment creating script which produces random maze-like levels. These can be varied in difficulty - both in terms of the overall size, and the number of obstacles and dead ends.
Here is an example curriculum:
-
Initial empty environment with two exits and no obstacles:
-
Environment with simple corridors:
- Environment with a difficulty of 1 (e.g. one of the free tiles is randomly filled):
- Difficulty of 2:
Eventually the levels grow in size and are supposed to culminate in something like this:
The agent’s basic observations are:
- 8 raycasts in every direction around it’s body
- It’s own position
- The location of the nearest exit
- The distance to the nearest exit
- The positions of the nearest 10 free tiles
In order to give the agent some persistent memory, I created my own version of the stacked vectors provided in the Behaviour parameters component. This stacks the position of the agent, but rather than stack the positions over time it stacks them over distance. For example, it will stack the agents position if it has moved to a new position, so that it has some memory of where it has been.
I have had some moderate success using PPO with these parameters:
Spy:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: null
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
gamma: 0.99
strength: 0.2
encoding_size: 256
learning_rate: 0.0003
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 10000000
time_horizon: 128
summary_freq: 30000
threaded: true
With these settings the agent can become quite proficient in traversing certain types of maze - ones where the path to the exit is fairly obvious, and the potential for going down a dead-end is small. However, it often quickly becomes undone if it takes the wrong turn down a dead end. The longer the route to the dead end the less often it seems to be able to backtrack and find the right path.
I’m also trying out SAC with these parameters, although it is fairing worse than the PPO agent:
Spy:
trainer_type: sac
hyperparameters:
learning_rate: 0.0001
learning_rate_schedule: constant
batch_size: 64
buffer_size: 64000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.01
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 20
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 10000000
time_horizon: 20
summary_freq: 30000
threaded: true```
When I try to tweak the configuration files (with either PPO or SAC) I seem to reduce the effectiveness of training. I am essentially guessing based on some vague intuitions when it comes to tweaking the parameters.
I have tried introducing a CNN by adding these parameters:
```memory:
memory_size: 128
sequence_length: 64```
However this failed to improve training.
Are there any parameters which I could change to improve training for this type of scenario? I'm happy to provide images of the TensorBoard in the comments if that's needed (I've posted the maximum amount of images here)