Configuring training parameters for maze-like environment.

I’m trying to create an agent which can traverse a maze-like environment, but the resulting behaviour has not lived up to my expectations.

I have developed a environment creating script which produces random maze-like levels. These can be varied in difficulty - both in terms of the overall size, and the number of obstacles and dead ends.

Here is an example curriculum:

  1. Initial empty environment with two exits and no obstacles:
    6242057--687383--upload_2020-8-25_13-36-25.png

  2. Environment with simple corridors:

6242057--687386--upload_2020-8-25_13-37-37.png

  1. Environment with a difficulty of 1 (e.g. one of the free tiles is randomly filled):

6242057--687389--upload_2020-8-25_13-38-17.png

  1. Difficulty of 2:

6242057--687392--upload_2020-8-25_13-38-36.png

Eventually the levels grow in size and are supposed to culminate in something like this:
6242057--687401--upload_2020-8-25_13-40-14.png

The agent’s basic observations are:

  • 8 raycasts in every direction around it’s body
  • It’s own position
  • The location of the nearest exit
  • The distance to the nearest exit
  • The positions of the nearest 10 free tiles

In order to give the agent some persistent memory, I created my own version of the stacked vectors provided in the Behaviour parameters component. This stacks the position of the agent, but rather than stack the positions over time it stacks them over distance. For example, it will stack the agents position if it has moved to a new position, so that it has some memory of where it has been.

I have had some moderate success using PPO with these parameters:

Spy:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
vis_encode_type: simple
memory: null
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
gamma: 0.99
strength: 0.2
encoding_size: 256
learning_rate: 0.0003
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 10000000
time_horizon: 128
summary_freq: 30000
threaded: true

With these settings the agent can become quite proficient in traversing certain types of maze - ones where the path to the exit is fairly obvious, and the potential for going down a dead-end is small. However, it often quickly becomes undone if it takes the wrong turn down a dead end. The longer the route to the dead end the less often it seems to be able to backtrack and find the right path.

I’m also trying out SAC with these parameters, although it is fairing worse than the PPO agent:

Spy:
trainer_type: sac
hyperparameters:
learning_rate: 0.0001
learning_rate_schedule: constant
batch_size: 64
buffer_size: 64000
buffer_init_steps: 0
tau: 0.005
steps_per_update: 10.0
save_replay_buffer: false
init_entcoef: 0.01
reward_signal_steps_per_update: 10.0
network_settings:
normalize: false
hidden_units: 20
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
keep_checkpoints: 5
max_steps: 10000000
time_horizon: 20
summary_freq: 30000
threaded: true```

When I try to tweak the configuration files (with either PPO or SAC) I seem to reduce the effectiveness of training. I am essentially guessing based on some vague intuitions when it comes to tweaking the parameters.

I have tried introducing a CNN by adding these parameters:

```memory:
memory_size: 128
sequence_length: 64```

However this failed to improve training.

Are there any parameters which I could change to improve training for this type of scenario? I'm happy to provide images of the TensorBoard in the comments if that's needed (I've posted the maximum amount of images here)

Really cool project! The key here is the observation stacking - how many can it see at once, and how do you deal with a variable stack size? If the agent doesn’t remember it entered a dead end, it won’t know to go back and find another path. Either way, this is a really hard problem for model-free RL.

If you’re using memory (LSTM) I’d remove the stacking, and note that you’ll require many more training steps to get a trained model.

There are 20 observations in the stack, 2 for each tile position, so that accounts for 10 previously visited ties. This should be plenty for any dead end the agent is likely to find. New positions are also only added to the stack if the agent moves a certain distance away from any other position in the stack, so even if the agent moves around a lot while in a certain area, the stack will not replace itself with new positions from the same area. These are padded with 0’s if the agent hasn’t or can’t move enough to fill the stack.

I realise this now. My initial plan for this project was to have the agent evade/sneak past other patrolling agents while traversing this environment. Pathfinding is proving tricky enough.

I’ll try removing the stack and being more patient with LSTM - it seems as though the agents ability to remember where it has been is crucial.