Hi all!
So I am using ML-Agents for my final year project for my engineering degree. I have made a very simple little fighting game that consists of two players that can do nothing, move left or right, perform a heavy, medium, or light attack, and can finally block.
The idea was to use ppo and self-play to train the agents and to hopefully end up seeing the little guys develop some sort of knowledge of the moveset and what it is capable of. I have it set up so that each attack has some sort of frame advantage on hit or block that kind of makes it easy to counter as long as you know the pattern.
Currently so far I am not having a whole lot of success with the training. The two players usually end up being able to beat each other up for a little bit before getting too scared of each other and just doing nothing. Here are some reward structures that I have tried:
-
The simple win/lose reward. +1 for the agent that wins, -1 for the one that loses, and 0 if it is a draw. Like is recommended in the self play docs.
-
Progressive - If a player has 100 health, 10 damage corresponds to 0.1 reward or penalty depending on if hit or got hit.
-
Adding a small distance penalty that scales up the further away from each other they are.
-
Adding a small constant time penalty
-
Adding a small penalty when the player performs an attack but does not hit anything
Now here are some rewards I have thought of but have not experimented with yet.
- Adding a small penalty if the agents holds the block for too long
- Halving the penalty
Okay that was just some info now for the actual questions.
- Iâm assuming the approach of the PPO algorithm used in PPO-Clip? I just want to confirm my understanding is right.
- In terms of self-play, is there a way in Unity to indicate which agent is currently the one that is training? I would love to just put a little arrow in the head for interest.
- Any advice on hyperparameters? I will post some of my results with hyperparameters at the end.
- How complicated should my NN be? I might be thinking my problem is more complicated than it should be. I have 9 inputs. They go myXpos (Normalized between -1 and 1), myVelocity (Normalized between -1 and 1), myCurrentAction (Taken from enum, normalized between 0 and max actions), myHealth (normalized between 0 and 1), repeat for opponent, distance between players (normalized between -1 and 1). The output is just one action. The player cannot move and attack at the same time.
- How many timesteps would you say this would need to train. I have done runs up to 50 million just based on the PPO paper⌠but Iâm thinking its too many.
- With self play, do we really expect the average cumulative reward to go up all the time? Or would we expect it to oscillate around 0 as the opponents are usually skill matched so you can expect to win 50% of the time? Or am I thinking about this wrong?
Here are some of my results:
First Run (SR0). This one used the simple win/lose reward. Iâm also not too sure what is happening with my episode length⌠Iâm thinking maybe because itâs not a multiple of my summary frequency but I donât get it so if you know please let me know.

Hyperparameters (SR0):
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 256
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 2048
summary_freq: 20000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 15
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 10000.0
win_reward: 1.0
damaged_reward_multiplier: 0.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 1.0
Second Run (SR1). This one experimented with having the half penalty as it felt like agents were too scared to get hit, they would not try to hit the other one.
Hyperparameters (SR1):
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 5
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 5000.0
win_reward: 1.0
damaged_reward_multiplier: 0.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 0.5
Run (SR2): With this one I tried the progressive reward with half penalty.
env_settings:
env_path: FinalBuild2/SKRIPSIE
num_envs: 1
engine_settings:
width: 960
height: 540
time_scale: 5
torch_settings:
device: cpu
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 3e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.995
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 20000
team_change: 100000
swap_steps: 40000
window: 5
play_against_latest_model_ratio: 0.4
environment_parameters:
max_env_steps: 5000.0
win_reward: 0.0
damaged_reward_multiplier: 1.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 0.5
Some more runs and tears T.T between these.
My latest runs seem a little bit more promising. After digging through the docs some more I came across the curiosity and and memory features and tried those.
Run (SR8): This run is still running on my computer as I write this. This one uses progressive rewards with memory added.
Hyperparameters (SR8):
behaviors:
SkripsieFighter:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 10240
learning_rate: 1.0e-4
beta: 1.0e-2
epsilon: 0.2
lambd: 0.9
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
memory:
memory_size: 128
sequence_length: 64
reward_signals:
extrinsic:
gamma: 1.0
strength: 1.0
max_steps: 50000000
time_horizon: 1024
summary_freq: 50000
self_play:
save_steps: 100000
team_change: 500000
swap_steps: 50000
window: 30
play_against_latest_model_ratio: 0.5
environment_parameters:
max_env_steps: 10000.0
win_reward: 0.0
damaged_reward_multiplier: 1.0
attack_missed_reward: 0.0
distance_reward: 0.0
time_reward: 0.0
block_hold_reward: 0.0
max_block_time: 0.0
penalty_multiplier: 1.0
So SR8 is so far my most promising run but itâs still not great. My agents still really like hiding in opposite sides of the map XD. The wins are not actually zero⌠just resumed the run.
If anybody is keen to chat about this with me or can maybe enlighten me to what I might be doing wrong that would be absolutely wonderful! Thank you so much for your time and let me know if you need any other information from me



