ML-agent not improving at all

Forestherd · November 29, 2020, 3:54pm

Hello!

I am a university student currently working on my thesis which is creating a volleyball-esque game and adding an opponent AI using Unity’s ML-Agents. At the moment I have all of the basic functionality in the game - the player move, jump, dash in any of the cardinal directions, interact with the ball and score points. I’ve set up an environment for teaching a model to play the game, however I have had no luck at getting a half-way decently working AI opponent - the mean reward never meaningfully increases! So here I am, asking for assistance.

I shall add my current code down below. Right now my reward function gives a small amount of points depending on how long the ball was in play and gives increasing points if the agent is closer to the ball, to incentives interacting with the ball. The reward structure will probably change once I get one decent result, but right now the agent isn’t even able to keep the ball in play for any reasonable amount of time. Considering the goal is to create an AI opponent capable of really playing the game and getting points, this isn’t a very good result.

PlayerAgent.cs

using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using UnityEngine.InputSystem;

public class PlayerAgent : Agent
{
    public Ball ball;
    public Player otherPlayer;
    public Player agentPlayer;
    public bool XFlipped;
   
    private float xFlipMul;
    private float accountedPoints;

    public override void CollectObservations(VectorSensor sensor)
    {
        // Agent
        var pos = agentPlayer.transform.localPosition;
        sensor.AddObservation(pos.x * xFlipMul);
        sensor.AddObservation(pos.y);
       
        var vel = agentPlayer.velocity.current;
        sensor.AddObservation(vel.x * xFlipMul);
        sensor.AddObservation(vel.y);

        // Opponent
        pos = otherPlayer.transform.localPosition;
        sensor.AddObservation(pos.x * xFlipMul);
        sensor.AddObservation(pos.y);

        vel = otherPlayer.velocity.current;
        sensor.AddObservation(vel.x * xFlipMul);
        sensor.AddObservation(vel.y);

        // Ball
        pos = ball.transform.localPosition;
        sensor.AddObservation(pos.x * xFlipMul);
        sensor.AddObservation(pos.y);
       
        vel = ball.velocity.current;
        sensor.AddObservation(vel.x * xFlipMul);
        sensor.AddObservation(vel.y);
    }

    public override void OnActionReceived(float[] vectorAction)
    {
        Vector2 movement = new Vector2();
        movement.x = Mathf.Abs(vectorAction[0]) > .5f ? vectorAction[0] * xFlipMul : 0;
        movement.y = Mathf.Abs(vectorAction[1]) > .5f ? vectorAction[1] : 0;
        agentPlayer.inputManager.SetMovementKey(movement);
        agentPlayer.inputManager.SetJumpKey(vectorAction[2]);
        agentPlayer.inputManager.SetDashKey(vectorAction[3]);
        float checkPoints = ball.Game.leftPoint;
        float otherPoints = ball.Game.rightPoint;

        if (agentPlayer.Game.Player2.Equals(agentPlayer))
        {
            checkPoints = ball.Game.rightPoint;
            otherPoints = ball.Game.leftPoint;
        }
       
        // Reached target
        if (checkPoints > accountedPoints)
        {
            EndEpisode();
            accountedPoints = checkPoints;
        }

        AddReward(0.01f);

        float dist = (agentPlayer.transform.position - ball.transform.position).magnitude;
        float threshold = 5f;
        if (dist < threshold)
        {
            AddReward(0.2f * Time.deltaTime * (1 - dist / threshold));
        }
       
    }

    public override void Heuristic(float[] actionsOut)
    {
        actionsOut[0] = (Keyboard.current.rightArrowKey.isPressed ? 1 : 0) - (Keyboard.current.leftArrowKey.isPressed ? 1 : 0);
        actionsOut[1] = (Keyboard.current.upArrowKey.isPressed ? 1 : 0) - (Keyboard.current.downArrowKey.isPressed ? 1 : 0);
        actionsOut[2] = Keyboard.current.zKey.isPressed ? 1 : 0;
        actionsOut[3] = Keyboard.current.xKey.isPressed ? 1 : 0;
    }

    public override void OnEpisodeBegin()
    {
        xFlipMul = XFlipped ? -1f : 1f;

        ball.Game.Reset(xFlipMul);
        ball.Game.leftPoint = 0;
        ball.Game.rightPoint = 0;
        accountedPoints = 0;
    }
}

configuration.yaml

default_settings: null
behaviors:
  PlayerBehaviour:
    trainer_type: ppo
    hyperparameters:
      batch_size: 32
      buffer_size: 512
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 500
      learning_rate_schedule: constant
    network_settings:
      normalize: false
      hidden_units: 128
      num_layers: 2
      vis_encode_type: simple
      memory: null
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        gamma: 0.99
        strength: 0.02
        encoding_size: 256
        learning_rate: 0.0003
    init_path: null
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 10000000
    time_horizon: 32
    summary_freq: 10000
    threaded: true
    self_play:
      save_steps: 10000
      team_change: 20000
      swap_steps: 2000
      window: 20
      play_against_latest_model_ratio: 0.5
      initial_elo: 1200.0
    behavioral_cloning: null
    framework: tensorflow
env_settings:
  env_path: null
  env_args: null
  base_port: 5005
  num_envs: 1
  seed: -1
engine_settings:
  width: 84
  height: 84
  quality_level: 5
  time_scale: 20
  target_frame_rate: -1
  capture_frame_rate: 60
  no_graphics: false
environment_parameters: null
checkpoint_settings:
  run_id: 22Nov
  initialize_from: null
  load_model: false
  resume: true
  force: false
  train_model: false
  inference: false
debug: false

TensorBoard:

+

Here is a link to the repo of the project, if you wish to try and test things out on your own or see how the game work: GitHub - TanelMarran/Voll-AI: 2 player volleyball versus game with machine learning AI

I am using Unity Version 2019.4.15f1.

Here are all of the dependencies in the python venv I use to train my models:

Package                Version
---------------------- ---------
absl-py                0.11.0
astunparse             1.6.3
attrs                  20.3.0
cachetools             4.1.1
cattrs                 1.0.0
certifi                2020.11.8
chardet                3.0.4
cloudpickle            1.6.0
future                 0.18.2
gast                   0.3.3
google-auth            1.23.0
google-auth-oauthlib   0.4.2
google-pasta           0.2.0
grpcio                 1.33.2
gym                    0.17.3
gym-unity              0.21.1
h5py                   2.10.0
idna                   2.10
Keras-Preprocessing    1.1.2
Markdown               3.3.3
mlagents               0.21.1
mlagents-envs          0.21.1
numpy                  1.18.0
oauthlib               3.1.0
opt-einsum             3.3.0
Pillow                 8.0.1
pip                    20.2.4
protobuf               3.14.0
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pyglet                 1.5.0
pypiwin32              223
pywin32                300
PyYAML                 5.3.1
requests               2.25.0
requests-oauthlib      1.3.0
rsa                    4.6
scipy                  1.5.4
setuptools             49.2.1
six                    1.15.0
tensorboard            2.4.0
tensorboard-plugin-wit 1.7.0
tensorflow             2.3.1
tensorflow-estimator   2.3.0
termcolor              1.1.0
urllib3                1.26.2
Werkzeug               1.0.1
wheel                  0.35.1
wrapt                  1.12.1

If there is any other info you would like me to share, please let me know. Thank you in advance!

mattinjersey · November 30, 2020, 5:08am

I wonder if you should give a negative value when it drops the ball.

mbaske · November 30, 2020, 9:56am

Hi, a couple of things…

You’re mixing continuous with discrete actions. Your behaviour parameters are set to space type = continuous, but the game logic expects discrete actions. Apparently you’re converting the former to the latter by doing action value > 0.5 conditionals. Change the space type to discrete instead, and create action branches for movement, jump and dash.

github.com

Unity-Technologies/ml-agents/blob/release_5/docs/Learning-Environment-Design-Agents.md#discrete-action-space

# Agents

**Table of Contents:**

- [Decisions](#decisions)
- [Observations and Sensors](#observations-and-sensors)
  - [Generating Observations](#generating-observations)
    - [Agent.CollectObservations()](#agentcollectobservations)
    - [Observable Fields and Properties](#observable-fields-and-properties)
    - [ISensor interface and SensorComponents](#isensor-interface-and-sensorcomponents)
  - [Vector Observations](#vector-observations)
    - [One-hot encoding categorical information](#one-hot-encoding-categorical-information)
    - [Normalization](#normalization)
    - [Stacking](#stacking)
    - [Vector Observation Summary & Best Practices](#vector-observation-summary--best-practices)
  - [Visual Observations](#visual-observations)
    - [Visual Observation Summary & Best Practices](#visual-observation-summary--best-practices)
  - [Raycast Observations](#raycast-observations)
    - [RayCast Observation Summary & Best Practices](#raycast-observation-summary--best-practices)
- [Actions](#actions)

This file has been truncated. show original

Your hyperparameter settings for batch_size and buffer_size should be fine for discrete actions, but are likely too low when using continuous actions.

github.com

Unity-Technologies/ml-agents/blob/release_5_docs/docs/Training-Configuration-File.md#common-trainer-configurations

# Training Configuration File

**Table of Contents**

- [Common Trainer Configurations](#common-trainer-configurations)
- [Trainer-specific Configurations](#trainer-specific-configurations)
  - [PPO-specific Configurations](#ppo-specific-configurations)
  - [SAC-specific Configurations](#sac-specific-configurations)
- [Reward Signals](#reward-signals)
  - [Extrinsic Rewards](#extrinsic-rewards)
  - [Curiosity Intrinsic Reward](#curiosity-intrinsic-reward)
  - [GAIL Intrinsic Reward](#gail-intrinsic-reward)
  - [Reward Signal Settings for SAC](#reward-signal-settings-for-sac)
- [Behavioral Cloning](#behavioral-cloning)
- [Memory-enhanced Agents using Recurrent Neural Networks](#memory-enhanced-agents-using-recurrent-neural-networks)
- [Self-Play](#self-play)
  - [Note on Reward Signals](#note-on-reward-signals)
  - [Note on Swap Steps](#note-on-swap-steps)

## Common Trainer Configurations

This file has been truncated. show original

You’re not normalizing agent observations, therefore position and velocity values vary too much for the learning algorithm to make sense of. They should be constrained to a range between -1 and +1. You can set the hyperparameter normalize = true, which will tell the algorithm to adapt to the observation values it is receiving over time. Or you can normalize the values yourself in your agent class, before adding them to the vector sensor (my personal preference). You could simply divide them by the maximum possible values for a linear mapping of e.g. zero distance → max distance to 0 → 1. In some cases though, a linear mapping is not ideal for what the agent needs to be aware of. For instance, a ball distance change from 1m to 2m is more critical than a change from 10m to 11m. An non-linear mapping would make more sense here. I often rely on a sigmoid function for this, it has a high resolution for small values and flattenes out for large ones. float normalized = value / (1f + Mathf.Abs(value));
Make sure to localize observations, so they are relative to the agent’s frame of reference. I think you’re doing this already with the xFlipMul field. Observations must not be different depending on what side the agent is playing at.

Forestherd · November 30, 2020, 7:37pm

I’m pretty sure I tried this once before, however that didn’t seem to help much. I’ll give it a go again when I get the time however!

Thank you so much for the feedback! Currently I do not have the time to try these out, however over the weekend I’ll try out any suggestions I get on this post and report the results.

celion_unity · November 30, 2020, 10:21pm

I don’t think self-play and your current reward structure go well together. Self-play should result in increasing ELO (which it looks like it’s doing) but not necessarily increasing reward; it just uses reward to determine the winner, and expects the rewards to be zero-sum between the teams.

I would recommend that you do one of:

disable self-play.
change your rewards so that agents get +1 reward for winning a point (is that the right volleyball term?) and -1 reward for losing.

I think that the second one is what you actually want, since it should eventually train agents to win.

hk1ll3r · January 3, 2022, 7:56pm

A bunch of things seems off with your reward and episode logic:

the existential reward should be negative, pushing the agent to try to end the episode as fast as possible. You are currently incentivizing the agent to do nothing (i.e. survive). It is hard for the agent to pick up on what actions are good or bad because they all get the same reward until the episode ends. Have a negative existential reward and positive rewards for specific game related things like distance to ball, kicking the ball, scoring. Explicit big negative reward (-1) for getting scored on.
The ball distance metric currently considers both x and y and is negative until the ball is at distance 1 from the player. Probably a better approach would be to only consider x delta (for a 2D game) and omit the height from the distance metric. As long as the player learns to go under the ball, that’s good enough. You are currently penalizing kicking the ball high.
Check that 1 is a good value for threshold of where to turn the ball distance reward positive. Depends on the size and scale of your gameobjects.
You end the episode only when the current player scores. What if the other agent scores? You should EndEpisode on BOTH agents if any of them scores. Currently each agent learns that by losing points they can extend their episode and accumulate more rewards (because your existential reward is positive and they don’t get a punishment for getting scored on and won’t end their episode when they lose).
Ending the episodes for both agents also applied to when MaxStep is reached. Set the MaxStep to 0 on your agents and handle it through your environment so that both agents start and end episodes at the same time.
You can simply end episodes on each point of the game. The optimal behavior to score a single point also plays well in a game to 10 or 15 points. Don’t complicate the game for your tini tiny ml-agents learning brain.

Ml-agents repository has a lot of good working example environments. Look at one of them to set your environment correctly. Very helpful.
https://github.com/Unity-Technologies/ml-agents

I made a slime volleyball with mlagents AI back in 2019. Check it out: Slime Volleyball feat Neural AI by No Such Dev

Topic		Replies	Views
Why is my ML Agents project not training? Unity Engine ML-Agents , com_unity_ml-agents	6	5076	May 8, 2021
tips for getting agent to learn to push ball into goal Unity Engine ML-Agents , Question , com_unity_ml-agents	2	841	August 18, 2023
How to choose observations? Unity Engine ML-Agents , Question , com_unity_ml-agents	7	3648	July 22, 2020
MLagent Warehouse robot not learning at all Unity Engine ML-Agents , Question , com_unity_ml-agents	8	1262	June 13, 2023
Agent interaction with objects in their environment Unity Engine ML-Agents , Question , com_unity_ml-agents	15	954	December 22, 2023

ML-agent not improving at all

Related topics