I have been training an agent to stack blocks in order to reach a target positioned on a higher surface. For example, the agent needs to stack three blocks in the shape of stairs and then jump on them to reach the target. However, despite trying various methods such as imitation learning and curriculum learning, the agent does not seem to perform the task successfully.
Given the sparse nature of rewards in the task, where the agent must learn to stack three blocks correctly to ultimately receive a reward upon colliding with the target. To address this, I have provided rewards to the agent for correctly placing the blocks in their designated locations.
I have trained the agents multiple times, with some of them reaching up to 200 million steps during the training process. Despite the extensive training, the agents still face challenges in successfully completing the task.
Obsevations: (All observations are in the agents local space)
Agent: Vector3 velocity (the agent’s own velocity)
bool isGrounded (whether the agent is mid air or not)
bool isHoldingObject (whether the agent is holding an object or not)
transform.InverseTransformDirection(agentRb.velocity);
Target: Vector3 position
transform.InverseTransformDirection(target.position - transform.position);
Spawner: Vector3 position (The spawner is a GameObject that generates a new block when an existing block is removed or taken)
transform.InverseTransformDirection(spawner.position - transform.position);
I later added observations for the block positions to a BufferSensor
Note: The environment I used for training is relatively small and has no obstacles, only consists of the objects mentioned earlier. So, the agent does not need to search for any objects within the environment.
Sensors:
Tags: “target”, “block”, “spawner”
Actions:
Discrete Actions:
Move: 0 to stop, 1 to move forward
Rotate: 0 for no rotation, 1 to rotate right, 2 rotate left
Jump: 0 for not jumping, 1 for jumping
Pickup: 0 to drop if holding an object, 1 to pickup if not holding an object
Config File:
behaviors:
BlockBuilder:
trainer_type: ppo
hyperparameters:
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512 # tried 256 with bigger layers
num_layers: 4 # tried 2, 3, and 8
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
rnd:
gamma: 0.99
strength: 0.01
network_settings:
hidden_units: 64
num_layers: 3
learning_rate: 0.0001
keep_checkpoints: 5
max_steps: 200000000
time_horizon: 128 # I also tried 500