Agent should find a position to move towards, but isn't improving

I’m trying to have my agent determine some coordinates based on an object’s position, but I keep having issues.

In my environment I have a ball that gets thrown at a random angle in the general direction of the agent. Both ball and agent are reset at a random position (within some X and Y constraints) ad the beginning of each episode.

I set up some values to determine a rectangle at a fixed position relative to the agent. Since the agent’s Z position is fixed, all that’s required for the rectangle are X and Y coordinates.

The agent’s task is to determine which X and Y coordinates to move towards in order for the ball’s X and Y coordinates to always fall within the rectangle (which moves together with the agent).

I’m giving the agent a small positive reward (+0.1) every time it guesses the right position, and a large negative reward (-10) whenever it guesses wrong. On a wrong guess, the episode also ends.

So far I can’t get the agent trained properly. Every time, regardless of how I change the hyperparameters and NN structure, its mean reward values never go above -9.945 (approximately).

Here’s my code for the rewards. Am I doing something blatantly wrong in here?

public GameObject ball;
public GameObject agent;
private Vector2 bounds;
private Vector2 estimatedTarget;

public override void CollectObservations(VectorSensor sensor) {
    sensor.AddObservation(new Vector2(ball.transform.localPosition.x, ball.transform.localPosition.y));
}

public override void OnActionReceived(ActionBuffers actions) {
    float estimateActionX = actions.ContinuousActions[0];
    float estimateActionY = actions.ContinuousActions[1];

    //Local coordinates of rectangle
    estimatedTarget = new Vector2(estimateActionX, estimateActionY);
    Vector2 lowBounds = new Vector2(estimatedTarget.x + bounds.x, estimatedTarget.y - bounds.y);
    Vector2 highBounds = new Vector2(estimatedTarget.x + 2f * bounds.x, estimatedTarget.y + bounds.y);
      
    //Reward based on correct estimation of target
    if (IsInRange(ball.transform.localPosition, lowBounds, highBounds)) { AddReward(0.1f); }
    else {
        AddReward(-10f);
        EndEpisode();
    }

    //Episode ends when the ball gets behind the agent
    if (ball.transform.localPosition.z <= agent.transform.localPosition.z) { EndEpisode(); }
}

public bool IsInRange(Vector3 ball, Vector2 min, Vector2 max) {
    return ball.x >= min.x && ball.x <= max.x && ball.y >= min.y && ball.y <= max.y;
}

My guess is that your reward values (specifically the negative reward) are not correct. Generally, you want to make sure that your rewards are bounded [-1, 1]. If your agent is doing nothing, it’s probably because it sees any action as resulting in negative reward which makes it seem like there is no incentive to do anything at all.

What I would do is have a small negative reward every fixed update that is 1/max_steps. This helps the agent understand that doing nothing will result in a negative reward. You can then give a small positive reward if the agent is moving towards the goal (the ball) and a large positive reward (1f) if the agent reaches the ball. Otherwise, if the episode ends and the agent didnt reach the goal state, give a large negative reward (-1f).

I see, that makes sense. I was just going off of the assumption that a large negative reward when performing badly would automatically be interpreted as an incentive to perform correctly (especially considering that the agent did perform correctly at the start).

About this, what about the episodes having variable length? The initial velocity of the ball changes with every throw, meaning each episode will end at a slightly different time. Should I just calculate how long each episode will last the moment it begins, and use that to calculate how many steps there will be in total?

Here what I want is for the agent to always find the right coordinates, though. There shouldn’t be any “moving towards the goal,” right? What I need is for the agent to output some acceptable coordinates at each step (the agent will then move following those coordinates, but the movement isn’t really important). Aren’t the final large rewards unnecessary, in this situation?