Hey Guys. I am training Ml-agents for my multiplayer racing game, but they aren’t able to learn, despite trying over 10 different strategies to train them.
Game Description:
It is a multiplayer racing game, where you control cars (shaped like balls/marbles). The cars auto-accelerate and have a max speed. You only have two controls
- Brake: Hold button to brake, and slow down car
- Boost: Faster acceleration to a higher max speed. (Stays for 2 sec, then refreshes after 10 second)
The track have walls which contain the cars, but its possible to fall off track if you are too fast or hit a wall too hard. For example. Not Slowing down on turn. If you fall, you are respawned within 1s at the last checkpoint.
Track have densely packed checkpoints placed all over them, which are simple unity box colliders used as triggers
ML-Agents Setup:
-
Goal
Finish Track as fast as possible, and avoid falling of tracks. One episode is one lap around the track or max step of 2500, normally it should take less than 2000 steps to finish a lap
-
Actions:
Single Discrete Action Branch with a size of three, Brake, Boost, Do nothing
If you boost, car enters boost state and subsequent boost inputs means nothing until the boost refresh again. Car automatically exists boost state after 1.5 seconds, then boost refreshes in 10 second
Brake: As long as brake input is received, car stays in brake state, and leave, when the input changes. Brake input also cancels boost state
-
Rewards and Punishments:
3.1 -1 for falling of track
3.2 +0.5 for passing through a turn without falling
3.3 -0.01 for spam actions, like braking too much (brake within 0.2ms of previous brake). You should hold brake to slow down, rather than spam braking
3.4 -0.5 if you apply boost and cancel it with brake within 1 sec (to encourage boosting at proper sections of track)
3.5 -0.001 on each step, to encourage finishing track faster
3.6 +0.1 * (normalized squared velocity) on passing through each checkpoint. (Faster the speed, more the reward)
-
Inputs
4.1 Normalized velocity
4.2 Boost cancel penalty active (means if agents cancels current boost state, it will be punished, since it just applied boost, less than 1 sec ago)
4.3 Spam Boost penalty active
4.4 Spam brake penalty active
4.5 Car State (Boost, Brake, or Running) (one hot encoded)
4.6 Incoming turn difficulty (Hard, Easy , Medium) (one hot encoded)
4.7 Incoming turn Direction (Left , Right) (one hot encoded)
4.8 Incoming turn distance (normalized between 0 and 1, if distance is less than 10, otherwise just 1)
4.9 Rays to judge position at track (distance from left and right walls of track)
4.10 Rays to see incoming turn (only three ray with 1 degree angle only)
Training Configs
behaviors:
Race:
trainer_type: ppo
hyperparameters:
batch_size: 512
buffer_size: 102400
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 5
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
gail:
gamma: 0.99
strength: 0.05
demo_path: Assets/ML-Agents/Demos/LegoV7.demo
behavioral_cloning:
strength: 0.5
demo_path: Assets/ML-Agents/Demos/LegoV7.demo
max_steps: 4000000
time_horizon: 64
summary_freq: 50000
threaded: true
ScreenShots
Problem
I have trained bots with tons of different configuration, with 1 or upto 50 stacked observations. Trained with and without demo (12000 steps for demo).
But my bots are not learning how to do a lap. They do not brake when a turn is coming, even though i have rewards and punishment setup for passing or falling off a turn. They also fail to learn the best places to boost, and would boost at very inconsistent locations, almost seems that they boost as soon as the boost is available.
I train bots on a very simple track first, for 2 million steps and then on a slightly more difficult track for 4 million steps.
I would absolute love if someone can offer me feedback on what am i doing wrong and how can i get the desired behavior. Its a simple enough task, yet I have been banging my head at it for a month now with barely any progress.
What I can tell is that you are overcomplicating things, it will never work.
- add invisible walls, set a very small negative reward on stay. You can remove them once trained, or just disable the rigidbody once it’s doing allright.
- remove all rewards. Keep just reward at each checkpoint.
- add a very small negative “existential” reward. That should be enough to motive him to go fast.
- observations: all you need is front ray cast sensors. Detectable tags are checkpoints and walls (and maybe other cars if you have collisions).
- actions, 2 discrete branches, one of 3 like you have and one of 3 for left right straight (assuming the car can turn, I don’t think it’s the case in your game right?).
That’s about it
Thankyou for the feedback
- In my game, the cars can’t steer themselves, they just go around the track and the curves of the track point them in the right direction. However if you approach a turn too fast, instead to turning, you can instead fall off the turn. It all physics based.
- Keeping this in mind, how would i encourage them to brake and slow down when a turn is approaching? Because even if i add invisible walls and give a negative reward on stay, they will keep getting the negative reward if they are on a straight patch of the track but just hugging a wall. There isn’t anything wrong in that, and car should not get a negative reward, only falling of the track when they fail to slow down on a turn is what i want to discourage. Do you have an idea how the environment or rewards should be designed that may help achieve this result?
I am attaching a quick snippet of gameplay to explain it better

Maybe add the walls with negative rewards only in the turns (the main reason for the walls is so it can continue learning and isn’t stopped/reset too often).
It should learn by itself to slow down. It might take some time, but it should learn that by itself.
It’s important to not lead it too much but instead let it figure out stuff by itself.
Maybe you could add a velocity observation value in addition to a ray sensor, and have a “Turn” tag on the turn walls. It should be able to sort things out and break when it gets close to a turn.
Thanks for the advice, I will try this out
Welcome to ML! Me trying to replicate some kind of PID controller for 1Dimension took 2 hours to train but is the simplest thing that exist, and still worked worse then a coursely set up PID controller.
PPO can work amazingly with problems that are not solveable otherwise, like walking with joints, but it really hassles to do anything smart.
- activate normalization in case you f-uped your normalization, otherwise the input will be cropped and not useable
- don’t give negative rewards for breaking and boosting: you want them to decide on this; now they just think breaking is bad… you can add this behaviour shaping later, for now you just want them to do anything smart
- What is the goal? Driving fast through the goalline? No! Breaking boosting how you think it should work? NO!
→ The goal is to get as fast as possible to the next chepoint which is AddReward(1 - timeAtArrival/MaxTime). Nothing more and nothing less. Having too big punishment for falling makes them do nothing bc it’s safer to be slow
- Let it run for at least 2 hours/2 mil steps and see what happens.
you know that already, I assume, but --no-graphics and up to 32 envs really boost the training. You could try SAC but in my experience it never works
Cheers!
1 Like
One more thing, start by training it in a straight line, then create a course with 4 left turns, make it train to turn left. Once he’s good, turn him around and let him turn right. Once he’s good he should be able to turn both left and right 
–no-graphics only works if there are no raycast sensors. To me, I prefer using raycast and give up on the no-graphics. Makes observations so much easier
I am currently working on another thing, I will come back to this training task soon and take all the feedback I got to retrain. Will share more updates soon. Thanks alot for taking the time to read my case and offering advice
1 Like
Prove me wrong please, but I’m pretty sure Raycast sensor works with physics Raycast (Colliders) so on CPU… the only thing that blocks --no-grahics and that I painfully experienced was that 2D CNNs (CameraSensor) of course don’t work.
I can prove you right
Just re-tested and yes it works, it’s only camera sensor that need UI.
1 Like
, thanks for the testing! Weirdly Ray Cast Sensors really improve the performance. I thought that in a static environment like my soccer game position rot etc. would be sufficient (overfitting to the map), but with Raycasts it worked a lot better! Camera Sensors and the gridsensor that is based on it destroy the performance though.
Sorry for hijacking / necro but do you know if grid sensors work with --no-graphics? It shouldn’t work (according to my limited knowledge), right? Thanks (again sorry for hijacking the post)
hey, I am pretty sure it does not work. It is implented (by some random guy and Unity took it) based on CNN, which only works when graphics is on (used with camera object).