Training simulation stopped after 7 hours of training for unknown reasons

Hello Everyone!

I am using ml agents version 1.6.0 and unity version 2019.4.14f1 my training stopped automatically. the screen shot of the problem is as follows:

I don’t know for what reason the training stopped. I would be glad if someone can help me on this issue?

Thanks!

so more screenshots to trace the problem…

My hyperparameters:
behaviors:
CarBehavior:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: true
hidden_units: 64
num_layers: 2
vis_encode_type: simple
reward_signals:
extrinsic:
gamma: 0.98
strength: 1.0
keep_checkpoints: 5
checkpoint_interval: 500000
max_steps: 5000000
time_horizon: 64
summary_freq: 50000
threaded: true

Continous Actions:
Accelerate mapped values(0 to 1)
brake mapped values (-1 to 0)

Goal:
maintain a distance from the car next to agent

observation:
distance to the next car

Hi @ammad99 ,
Did you see anything in the unity Editor or Player log? Did your environment crash? Could you provide more info for us, please?
Cheers,
Chris

Hi @christophergoy Many thanks for your message. Please see the following screenshots that can be helpful in tracing the problem.

In my console window i got this:

Although i stopped the simulation but my task manager shows me these numbers for memory:


6648694--759409--upload_2020-12-22_10-17-47.png

What i observed was my environment was stuck in the middle of the simulation because my agent was at its mean position where as the moving car was stuck somewhere in the middle of the road (not at mean position).

Also to mention that i am trying to train only one agent…
At the moment my unity software is stuck so have to forcefully close it

previously i was using ml agents 0,21 and ml agent 1.6 in unity but then i changed it to 1.5 because i thought maybe 1.6 is not compatible with 0.21 ml agents (tensorflow) still the problem is there :(…

Any help in this regard would be highly appreciated because i don’t know what is the problem

maybe it’s a memory issue but don’t know how to solve it :frowning:

digging more into the problem i found out that when the training is stuck then pressing “ESC” key helps before it shows you this error which is that the environment took too long to respond…

my 2 cents:
32 gigs of ram should be enough to run almost every homemade sym. Looking at your memory graph, it seems like your code is accumulating data without ever discarding them, leading to the “out of memory” message.
Maybe a list became too big? huge arrays stored in multiple copies? gameobjects being deactivated instead of destroyed, and accumulate over time?
I suggest you to check if everything is properly initialized between episodes.