Buffer Size Parameter - Clarification

Hey!

I was just wondering if I understood the buffer_size parameter correctly. The documentation confused me somehow.

Documentation

(default = 10240 for PPO and 50000 for SAC) Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. This should be multiple times larger than batch_size. Typically a larger buffer_size corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences.
Typical range: PPO: 2048 - 409600; SAC: 50000 - 1000000

My understanding is the following:
PPO: Policy updates occur every time the number of experiences defined for this parameter are collected. If we set this value to 3k it will take 3k agent steps to collect 3k experiences. Those experiences are then split up into batches of our defined batch_size and fed into the neural network to perform the weight updates. This then happens every 3k agent steps.

SAC: Defines the size of the experience replay buffer and not the update frequency which is specified in a parameter named steps_per_update.

If my understanding is correct (please let me know if I got something wrong), I’d rephrase the documentation text to something like this:

Updated Documentation

(default = 10240 for PPO and 50000 for SAC) - different behavior for PPO and SAC!
PPO: Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. This should be multiple times larger than batch_size.Typically a larger buffer_size corresponds to more stable training updates.
SAC: max size of the experience buffer - on the order of thousands of times greater than the episode length so that SAC can learn from old as well as new experiences.
Typical range: PPO: 2048 - 409600; SAC: 50000 - 1000000

If I understood it correctly, please let me know if you prefer the updated text, so that I can make a pull request on GitHub.

Hi,
Your understanding is correct and I think this change would clarify our documentation. Do you want to make a PR or do you want me to take care of it?

Okay :slight_smile: I can make a PR - will post the PR link here during the next hour

https://github.com/Unity-Technologies/ml-agents/pull/4252

Hi, I came across this trying to find an even more detailed explanation of what exactly buffer size is. But now I think I am unclear about experiences itself. This is specially with regards to PPO.

From my understanding, an experience is a step with a query to the policy and not just a regular agent step. What I mean is that if the “Max Step” parameter in the Agent’s Script parameters is set to suppose 2500 - this refers to the number of actions the agent performs per 1 episode. This is analogous to how many times fixedUpdate is called. However, with the introduction of the “Decision Period” parameter in the Decision Requester component, things get a little more confusing. Suppose Decision Period is set to 5, this means that after every 5 actions, the policy is actually queried to get a policy action and then consequently a reward. This makes it so that 2500/5 = 500 steps are the “true” steps that are actually “experiences”. Am I right in thinking this?

So assuming I am right, we have 500 experiences per episode. Out of these 500 experiences per episode only previous “time_horizon” amount of them contribute to the outcome of that episode. Am I right about the time_horizon part?

Now, after “buffer_size” amount of experiences, the policy is actually updated. So if my buffer_size is 30000, then after 30000 experiences or 30000/500 = 60 episodes, the policy is updated. I am assuming this buffer is called the experience buffer.

Next, “batch_size” is another challenging thing to understand. We split our “buffer_size” experiences into n “batch_size” batches of experiences to “update” the policy n times. So if the batch_size is 300 and buffer_size is 30000, we do 30000/300 = 100 “iterations” of gradient descent. Now I don’t understand where “num_epoch” comes into play here or what it’s purpose is.

Next question I have is, how do I calculate the memory of my experience buffer ? I keep getting errors related to “sequence_length invalid” or “broken pipe” or " UnityEnvironment worker 0: environment raised an unexpected exception." when I try to increase my buffer_size >= 8192. I know increasing buffer size can lead to more “RAM? VRAM?” consumption but I believe this is a relatively small buffer size and I should not be getting these errors. I will post the error logs below but before that I want to clarify the memory calculation.

Memory(Experience buffer) = Memory(Observations + Actions + Reward) * buffer_size

Is this correct?

In my scenario, I just want a car to put a ball in the goal.

Car has 2 continuous input actions:

  1. Throttle - Forward / Backward Acceleration - Float
  2. Steering Direction - Float
    Mem(Actions) = 4 + 4 = 8 Bytes

Observations:

  1. 32 x 32 Grayscale FPS Visual Observation
  2. Single Raycast Distance - Float
  3. Current Steering Direction - Float
  4. Current Throttle - Float
    Mem(Observation) = (32 x 32) + 4 + 4 + 4 = 1024 + 4 + 4 + 4 = 1036 Bytes

Rewards:

  1. Discrete reward when Car makes contact with Ball
  2. Discrete reward when Ball makes contact with Goal
  3. Inverse Distance Squared from Car to Ball (cutoff after contact with ball)
  4. Inverse Distance Squared from Ball to Goal (starts after contact with ball)
    Mem(Rewards) = 4 (since all are added together to one float)

Taking these into the equation:
Mem(Experience Buffer) = (8 + 1036 + 4) * 8192 = 8,585,216 bytes = 8.5 MB

If this is true, then I should be having no problem with my 16 GB RAM and Nvidia 3070 Ti with dedicated 8 GB VRAM. I am stating both because I fail to understand still how to properly utilize the GPU during mlagents training due to the poor documentation on this subject. The only thing I am doing to utilize my GPU right now is adding --torch-device=cuda to my mlagents-learn command. I have of course downloaded pytorch that is built with cuda and made sure to get the corresponding CUDA toolkit version. I have no idea where this experience buffer is being stored. I checked task manager and that was pretty unhelpful too.

I would really appreciate it if someone could clarify these for me.

Error Logs from my latest run batch_size 1024 and buffer_size 10240:
(mlagents) C:\Users\Anurag\ml-agents-latest_release>mlagents-learn config/Car2Ball_visual_curiosity_config_v3.yaml --run-id=test3_1024_10240 --torch-device=cuda --resume

┐ ╖
╓╖╬│╡ ││╬╖╖
╓╖╬│││││┘ ╬│││││╬╖
╖╬│││││╬╜ ╙╬│││││╖╖ ╗╗╗
╬╬╬╬╖││╦╖ ╖╬││╗╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╜╜╜ ╟╣╣
╬╬╬╬╬╬╬╬╖│╬╖╖╓╬╪│╓╣╣╣╣╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╒╣╣╖╗╣╣╣╗ ╣╣╣ ╣╣╣╣╣╣ ╟╣╣╖ ╣╣╣
╬╬╬╬┐ ╙╬╬╬╬│╓╣╣╣╝╜ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╣╙ ╙╣╣╣ ╣╣╣ ╙╟╣╣╜╙ ╫╣╣ ╟╣╣
╬╬╬╬┐ ╙╬╬╣╣ ╫╣╣╣╬ ╟╣╣╬ ╟╣╣╣ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣ ╣╣╣┌╣╣╜
╬╬╬╜ ╬╬╣╣ ╙╝╣╣╬ ╙╣╣╣╗╖╓╗╣╣╣╜ ╟╣╣╬ ╣╣╣ ╣╣╣ ╟╣╣╦╓ ╣╣╣╣╣
╙ ╓╦╖ ╬╬╣╣ ╓╗╗╖ ╙╝╣╣╣╣╝╜ ╘╝╝╜ ╝╝╝ ╝╝╝ ╙╣╣╣ ╟╣╣╣
╩╬╬╬╬╬╬╦╦╬╬╣╣╗╣╣╣╣╣╣╣╝ ╫╣╣╣╣
╙╬╬╬╬╬╬╬╣╣╣╣╣╣╝╜
╙╬╬╬╣╣╣╜

Version information:
ml-agents: 1.0.0,
ml-agents-envs: 1.0.0,
Communicator API: 1.5.0,
PyTorch: 1.13.1+cu117
[WARNING] Training status file not found. Not all functions will resume properly.
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
[INFO] Connected to Unity environment with package version 3.0.0-exp.1 and communication version 1.5.0
[INFO] Connected new brain: Car2Ball?team=0
[INFO] Hyperparameters for behavior name Car2Ball:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 0.0003
beta: 0.005
epsilon: 0.2
lambd: 0.95
num_epoch: 3
shared_critic: True
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
checkpoint_interval: 500000
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
curiosity:
gamma: 0.99
strength: 0.02
network_settings:
normalize: False
hidden_units: 128
num_layers: 2
vis_encode_type: simple
memory: None
goal_conditioning_type: hyper
deterministic: False
learning_rate: 0.003
encoding_size: None
init_path: None
keep_checkpoints: 5
even_checkpoints: False
max_steps: 30000000
time_horizon: 128
summary_freq: 50000
threaded: True
self_play: None
behavioral_cloning: None
[INFO] Resuming from results\test3_1024_10240\Car2Ball.
[INFO] Resuming training from step 499978.
[INFO] Car2Ball. Step: 500000. Time Elapsed: 6.339 s. No episode was completed since last summary. Training.
[INFO] Exported results\test3_1024_10240\Car2Ball\Car2Ball-499978.onnx
[INFO] Car2Ball. Step: 550000. Time Elapsed: 46.842 s. Mean Reward: 166.833. Std of Reward: 85.181. Training.
[INFO] Car2Ball. Step: 600000. Time Elapsed: 89.331 s. Mean Reward: 154.484. Std of Reward: 67.336. Training.
[INFO] Car2Ball. Step: 650000. Time Elapsed: 131.203 s. Mean Reward: 140.996. Std of Reward: 85.288. Training.
[INFO] Car2Ball. Step: 700000. Time Elapsed: 173.653 s. Mean Reward: 152.901. Std of Reward: 66.126. Training.
[INFO] Car2Ball. Step: 750000. Time Elapsed: 217.055 s. Mean Reward: 146.363. Std of Reward: 74.872. Training.
[INFO] Car2Ball. Step: 800000. Time Elapsed: 256.871 s. Mean Reward: 148.012. Std of Reward: 72.254. Training.
[INFO] Car2Ball. Step: 850000. Time Elapsed: 298.509 s. Mean Reward: 152.311. Std of Reward: 93.526. Training.
[INFO] Car2Ball. Step: 900000. Time Elapsed: 340.512 s. Mean Reward: 147.693. Std of Reward: 92.437. Training.
[INFO] Car2Ball. Step: 950000. Time Elapsed: 382.421 s. Mean Reward: 152.774. Std of Reward: 66.762. Training.
Exception in thread Thread-2 (trainer_update_func):
Traceback (most recent call last):
[ERROR] UnityEnvironment worker 0: environment raised an unexpected exception.
Traceback (most recent call last):
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 312, in _recv_bytes
nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py”, line 175, in worker
req: EnvironmentRequest = parent_conn.recv()
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 250, in recv
buf = self._recv_bytes()
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 321, in _recv_bytes
raise EOFError
EOFError
Process Process-1:
Traceback (most recent call last):
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 312, in _recv_bytes
nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py”, line 175, in worker
req: EnvironmentRequest = parent_conn.recv()
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 250, in recv
buf = self._recv_bytes()
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 321, in _recv_bytes
raise EOFError
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\process.py”, line 314, in _bootstrap
self.run()
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py”, line 235, in worker
_send_response(EnvironmentCommand.ENV_EXITED, ex)
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\site-packages\mlagents\trainers\subprocess_env_manager.py”, line 150, in _send_response
parent_conn.send(EnvironmentResponse(cmd_name, worker_id, payload))
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File “C:\Users\Anurag\miniconda3\envs\mlagents\lib\multiprocessing\connection.py”, line 280, in _send_bytes
ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
BrokenPipeError: [WinError 232] The pipe is being closed

Thanks
Anurag