Wall Jump Example Stops Training Abruptly

Description
When I run training for the Wall Jump Example in the ml-agents-release1 folder,

mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force

and press the play button, the training starts like usual, but everything comes to a stop in about 30 seconds. The agent is floating in midair, the Unity window stops responding, and the Command Prompt does not have any more output. 40% CPU usage is taken by a Python process during this period. Ctrl-C in the Command Prompt causes the Unity window to unfreeze, but the Python process still runs in the Command Prompt (consuming 40% still). The last line in the CMD output is after I do the Ctrl-C. I have to end the process from Task Manager for it to stop.
Any idea what could be going wrong? I use MLAgents release 1 as downloaded from the GitHub page.

Versions
Unity: 2019.3.13f1
Python: 3.7.7
ml-agents: 0.16.0,
ml-agents-envs: 0.16.0,
Communicator API: 1.0.0,
TensorFlow: 2.1.0

CMD output
mlagents-learn config/trainer_config.yaml --run-id=WallJump2 --force
2020-05-30 21:29:10.318307: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
β–„β–„β–„β–“β–“β–“β–“
β•“β–“β–“β–“β–“β–“β–“β–ˆβ–“β–“β–“β–“β–“
,β–„β–„β–„m▀▀▀’ ,β–“β–“β–“β–€β–“β–“β–„ β–“β–“β–“ β–“β–“β–Œ
▄▓▓▓▀’ β–„β–“β–“β–€ β–“β–“β–“ β–„β–„ β–„β–„ ,β–„β–„ β–„β–„β–„β–„ ,β–„β–„ β–„β–“β–“β–Œβ–„ β–„β–„β–„ ,β–„β–„
β–„β–“β–“β–“β–€ β–„β–“β–“β–€ β–β–“β–“β–Œ β–“β–“β–Œ ▐▓▓ β–β–“β–“β–“β–€β–€β–€β–“β–“β–Œ β–“β–“β–“ β–€β–“β–“β–Œβ–€ ^β–“β–“β–Œ β•’β–“β–“β–Œ
β–„β–“β–“β–“β–“β–“β–„β–„β–„β–„β–„β–„β–„β–„β–“β–“β–“ β–“β–€ β–“β–“β–Œ ▐▓▓ ▐▓▓ β–“β–“β–“ β–“β–“β–“ β–“β–“β–Œ ▐▓▓▄ β–“β–“β–Œ
β–€β–“β–“β–“β–“β–€β–€β–€β–€β–€β–€β–€β–€β–€β–€β–“β–“β–„ β–“β–“ β–“β–“β–Œ ▐▓▓ ▐▓▓ β–“β–“β–“ β–“β–“β–“ β–“β–“β–Œ ▐▓▓▐▓▓
^β–ˆβ–“β–“β–“ β–€β–“β–“β–„ β–β–“β–“β–Œ β–“β–“β–“β–“β–„β–“β–“β–“β–“ ▐▓▓ β–“β–“β–“ β–“β–“β–“ β–“β–“β–“β–„ β–“β–“β–“β–“'β–€β–“β–“β–“β–„ ^β–“β–“β–“ β–“β–“β–“ β””β–€β–€β–€β–€ β–€β–€ ^β–€β–€β–€β–€ β–€β–€ 'β–€β–€ β–β–“β–“β–Œ β–€β–€β–€β–€β–“β–„β–„β–„ β–“β–“β–“β–“β–“β–“, β–“β–“β–“β–“β–€ β–€β–ˆβ–“β–“β–“β–“β–“β–“β–“β–“β–“β–Œ
Β¬`β–€β–€β–€β–ˆβ–“
Version information:
ml-agents: 0.16.0,
ml-agents-envs: 0.16.0,
Communicator API: 1.0.0,
TensorFlow: 2.1.0
2020-05-30 21:29:13.318042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
WARNING:tensorflow:From C:\Users\nihal\anaconda3\envs\tf_gpu\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2020-05-30 21:29:15 INFO [environment.py:201] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
2020-05-30 21:29:20 INFO [environment.py:111] Connected to Unity environment with package version 1.0.0-preview and communication version 1.0.0
2020-05-30 21:29:20 INFO [environment.py:342] Connected new brain:
SmallWallJump?team=0
2020-05-30 21:29:20.729678: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2020-05-30 21:29:20.739893: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-30 21:29:20.776772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-05-30 21:29:20.786215: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-30 21:29:20.795884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-30 21:29:20.805575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-30 21:29:20.812470: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-30 21:29:20.825955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-30 21:29:20.834143: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-30 21:29:20.846617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-30 21:29:20.853374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-30 21:29:21.471889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-30 21:29:21.477498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-05-30 21:29:21.481310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-05-30 21:29:21.485303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) β†’ physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851932.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
2020-05-30 21:29:21 WARNING [stats.py:197] events.out.tfevents.1590851971.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
2020-05-30 21:29:21 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_SmallWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 5e6
memory_size: 128
normalize: False
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: False
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
summary_path: WallJump2_SmallWallJump
model_path: ./models/WallJump2/SmallWallJump
keep_checkpoints: 5
2020-05-30 21:29:21.522400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-05-30 21:29:21.533581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-30 21:29:21.538149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-30 21:29:21.542434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-30 21:29:21.547138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-30 21:29:21.551515: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-30 21:29:21.556379: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-30 21:29:21.561277: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-30 21:29:21.566907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-30 21:29:21.569910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-30 21:29:21.574760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-05-30 21:29:21.577636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-05-30 21:29:21.580530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) β†’ physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-30 21:29:23.056474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-30 21:29:23 INFO [environment.py:342] Connected new brain:
BigWallJump?team=0
2020-05-30 21:29:23 WARNING [env_manager.py:109] Agent manager was not created for behavior id BigWallJump?team=0.
2020-05-30 21:29:23.572885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-05-30 21:29:23.582329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-30 21:29:23.587127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-30 21:29:23.591483: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-30 21:29:23.596612: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-30 21:29:23.601308: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-30 21:29:23.606158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-30 21:29:23.610582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-30 21:29:23.616124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-30 21:29:23.619079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-30 21:29:23.623984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-05-30 21:29:23.626724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-05-30 21:29:23.630220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) β†’ physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851934.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
2020-05-30 21:29:23 WARNING [stats.py:197] events.out.tfevents.1590851973.LAPTOP-3CHHMIT0 was left over from a previous run. Deleting.
2020-05-30 21:29:23 INFO [stats.py:130] Hyperparameters for behavior name WallJump2_BigWallJump:
trainer: ppo
batch_size: 128
beta: 0.005
buffer_size: 2048
epsilon: 0.2
hidden_units: 256
lambd: 0.95
learning_rate: 0.0003
learning_rate_schedule: linear
max_steps: 2e7
memory_size: 128
normalize: False
num_epoch: 3
num_layers: 2
time_horizon: 128
sequence_length: 64
summary_freq: 20000
use_recurrent: False
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
summary_path: WallJump2_BigWallJump
model_path: ./models/WallJump2/BigWallJump
keep_checkpoints: 5
2020-05-30 21:29:23.658786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti computeCapability: 7.5
coreClock: 1.59GHz coreCount: 24 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 268.26GiB/s
2020-05-30 21:29:23.668717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-30 21:29:23.672988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-05-30 21:29:23.678131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-05-30 21:29:23.682779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-05-30 21:29:23.687796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-05-30 21:29:23.692050: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-05-30 21:29:23.697388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-05-30 21:29:23.702211: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-30 21:29:23.705757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-30 21:29:23.710373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-05-30 21:29:23.713525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-05-30 21:29:23.717303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4625 MB memory) β†’ physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-05-30 21:30:12 INFO [subprocess_env_manager.py:191] UnityEnvironment worker 0: environment stopping.

I’ll flag this for the team to take a look.

1 Like

So I somehow managed to solve this issue through some reinstallation. Previously, I had downloaded CUDA, CUDNN and tensorflow-gpu through

conda install tensorflow-gpu
which automatically gets the correct versions of TF, CUDA and CUDNN. This arrangement has worked well with my other deep learning codes (like MNIST digit recognition).
This time, I first uninstalled Anaconda (which removed all conda installed packages including CUDA and CUDNN). Then I installed CUDA and CUDNN manually according to the Nvidia website. Then I did

conda install tensorflow-gpu which again downloads CUDA and CUDNN for some reason I don’t know, but the versions are exactly the same as my manual CUDA CUDNN install. Then after installing the correct mlagents python package and Unity package, it finally worked.

1 Like

Thanks for the update! Happy to hear you got it working!

1 Like