well actually what i want to know that is, how should i change my own python algorithm when there are multiple training prefabs in Unity.
i saw Unity’s docs which describe that i’m able to duplicate training object prefab, which is called parallel training, but it describes only when you use built-in algorithm on ml-agents.
could you give me any tutorials for this matter or advice?
In the gridworld environment this example uses, there are 9 “prefabs” corresponding to 9 agents. This is illustrated by the number of elements in the DecisionSteps and TerminalSteps being variable. The Trainer class in the tutorial deals with multiple Agents. To setup your Unity scene with multiple prefab Agents, you do not need to do anything different than what the provided algorithms require, having multiple Agents with the same Behavior Name will cause all of them to send and receive data to and from Python.
The three colab notebooks I sent are the only ones available at this moment to learn how to use the Low Level Python API. Do you have specific questions on how this API works ?
well, first of all, i really want to say thank you all of those kinda tutorials that i need.
i don’t know why i couldn’t find this one, actually i tried to find any kind of this tuts pretty hard.
and your colab codes are really helpful.
here’s some questions.
how should i change code if Unity Environment has ‘5 decision period’ like in your
ML-Agents Q-Learning with GridWorld ?
to be related above question, should i widen(increase) decision period if my unity environment need to calculate many physical things(e.g collision …)?
If the decision period is changed for all Agents, nothing should change. Unity would simply send messages less frequently but Python will receive the observations in a batch. If the period changes differently for different Agents, then you will receive observations when Agents request decisions (make sure to look at what the AgentIds are in the DecisionSteps!) The code in the Colab should be able to handle a different decision period if my memory is correct.
Picking the right decision period can be challenging. If the period is too small, each decision the algorithm makes will have “less” importance and it might be a lot harder for the Agent to assign credit to good and bad decisions. On the other hand, if the period is too large, the Agent might not be able to control the game effectively since the actions would be very sticky. It depends on the game, I usually set this value to be around 5 when physics is important. Unity does a fixed update 50 times per second (every 0.02 seconds) unless specified otherwise. So a period of 5 corresponds to 10 decisions per second (every 0.1 seconds).
This error occurs because the number of decisions requested and the number of decisions sent did not match. Unity was expecting 0 decisions but 9 were sent. The reason 0 decisions were requested is most likely because some agent(s) terminated in between decisions : You had 0 DecisionSteps but some TerminalSteps. You need to make sure that the shape of the action you send is always (number_of_agents, decision_size) and number_of_agents can be 0 if no decisions were requested last step.
If no agents requested a decision, then they do not need an action.
I think storing in myaction the whole batch of actions (for 9 agents) is a terrible idea because the order of the AgentIds is not guaranteed. I think the best way to store trajectories is to have individual trajectories for each Agent (like what is done here : Google Colab) This way your code will be a lot more robust to terminal steps and to Agents skipping decisions.
If you really want to store trajectories as a batched trajectories (which I do not recommend). You will have to only store the actions taken when all 9 agents requested a decision while somehow managing the observations and reward received in the terminal steps.
In ML-Agents, Agents can request decisions and terminate anytime. This is to allow more flexibility to the C# developer. This means that the data received by Python is unordered and does not have guarantees to always have the same number of Agents (and can even have 0 agents requesting decisions during a step to allow C# to signal an Agent terminated). This means that Python must keep track of individual Agents when storing trajectories.
def generate_trajectories(self, env, buffer_size):
# Create an empty Buffer
buffer: Buffer = []
env.reset()
behavior_name = list(env.behavior_specs)[0]
spec = env.behavior_specs[behavior_name]
# Create a Mapping from AgentId to Trajectories.
# This will help us create trajectories for each Agents.
dict_trajectories_from_agent: Dict[int, Trajectory] = {}
dict_last_obs_from_agent: Dict[int, np.ndarray] = {}
dict_last_action_from_agent: Dict[int, np.ndarray] = {}
# Only for reporting
dict_cumulative_reward_from_agent: Dict[int, float] = {}
cumulative_rewards: List[float] = []
# while not enough data in the buffer
while len(buffer) < buffer_size:
# Get the Decision Steps and Terminal Steps of the Agents
decision_steps, terminal_steps = env.get_steps(behavior_name)
# For all Agents with a Terminal Step:
for agent_id_terminated in terminal_steps:
# Create its last experience (is last because the Agent terminated)
last_experience = Experience(
obs=dict_last_obs_from_agent[agent_id_terminated].copy(),
# obs=dict_last_obs_from_agent[agent_id_terminated].copy(),
reward=terminal_steps[agent_id_terminated].reward,
done=not terminal_steps[agent_id_terminated].interrupted,
action=dict_last_action_from_agent[agent_id_terminated].copy(),
next_obs=terminal_steps[agent_id_terminated].obs[0]
)
at the beginning of generate_trajectories and first loop,
decision_steps has [0, 1, 2, 3, 4, 5, 6, 7, 8] and terminal_steps has [3] while dict_last_obs_from_agent is initialized with empty dictionary.
so KeyError is occurred at obs=dict_last_obs_from_agent[agent_id_terminated].copy() because dict_last_obs_from_agent is empty.
seems like there wouldn’t be any problem if there is only decision_steps and empty terminal_steps at very first loop but i got KeyError because there is also terminal_steps at very first loop.
Does this happen with GridWorld? I never observed that before and was able to train GridWorld correctly. I think your guess is right, there is something in the environment that terminates an Agent right after the first decision (which should not happen in GridWorld). I would recommend ignoring it (if agent_id_terminated not in dict_last_obs_from_agent do nothing).