Agent Struggling to Learn Multi-Step Task

I have been training an agent to stack blocks in order to reach a target positioned on a higher surface. For example, the agent needs to stack three blocks in the shape of stairs and then jump on them to reach the target. However, despite trying various methods such as imitation learning and curriculum learning, the agent does not seem to perform the task successfully.

Given the sparse nature of rewards in the task, where the agent must learn to stack three blocks correctly to ultimately receive a reward upon colliding with the target. To address this, I have provided rewards to the agent for correctly placing the blocks in their designated locations.

I have trained the agents multiple times, with some of them reaching up to 200 million steps during the training process. Despite the extensive training, the agents still face challenges in successfully completing the task.

Obsevations: (All observations are in the agents local space)
Agent: Vector3 velocity (the agent’s own velocity)
bool isGrounded (whether the agent is mid air or not)
bool isHoldingObject (whether the agent is holding an object or not)

transform.InverseTransformDirection(agentRb.velocity);

Target: Vector3 position

transform.InverseTransformDirection(target.position - transform.position);

Spawner: Vector3 position (The spawner is a GameObject that generates a new block when an existing block is removed or taken)

transform.InverseTransformDirection(spawner.position - transform.position);

I later added observations for the block positions to a BufferSensor

Note: The environment I used for training is relatively small and has no obstacles, only consists of the objects mentioned earlier. So, the agent does not need to search for any objects within the environment.

Sensors:
Tags: “target”, “block”, “spawner”

Actions:
Discrete Actions:
Move: 0 to stop, 1 to move forward
Rotate: 0 for no rotation, 1 to rotate right, 2 rotate left
Jump: 0 for not jumping, 1 for jumping
Pickup: 0 to drop if holding an object, 1 to pickup if not holding an object

Config File:

behaviors:
  BlockBuilder:
    trainer_type: ppo
    hyperparameters:
      batch_size: 128
      buffer_size: 2048
      learning_rate: 0.0003
      beta: 0.01
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 512 # tried 256 with bigger layers
      num_layers: 4 # tried 2, 3, and 8
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      rnd:
        gamma: 0.99
        strength: 0.01
        network_settings:
          hidden_units: 64
          num_layers: 3
        learning_rate: 0.0001
    keep_checkpoints: 5
    max_steps: 200000000
    time_horizon: 128 # I also tried 500

as a first thing try to change normalization to True. Also your problem seems very hard, IDK that it can work like this

hi,Recently i learned curriculum learning. if there are two behaviors, one is trained by CL, another not. How can achieve it? I found CL setting in trainer.yaml is global.
Thanks

hello,

idk sorry. The question is if it’s problematic for your case to just read out the values in Unity when one of them reaches a certain threshold

edit: posted in your thread instead

Hmmm… My mean there are two behavior(brains) to be trained. I want to train one brain with CL and another not. Usually, We configure it in a YAML file.Is that right? Currently with my knowledge and knowledge I don’t know of other configuration methods.I tried a lot as follows:
CL is subordinate to environmental parameters. In the training file, the level and behavior of the environment parameters are the same. I initially thought that the name of the behavior in the YAML determines which behavior uses the CL, but it turned out that was not the case. It affects both behaviors. After that, I tried to subordinate the environment variable to the behavior I wanted to train with the course learning, but the mlagent reported an error, and the document specifically emphasized that the course learning is only subordinate to the environment parameter, that is, it cannot be further subordinate to a certain behavior.

Case you hadn’t seen it in the showcase sticky, someone did something a bit similar.