I am currently working on a project where an agent has to perform 5 discrete actions and 2 continuous actions.

Thankfully, in the latest implementation of Unity ML-Agents, it seems that hybrid control is a possibility, since we can implement discrete and continuous actions simultaneously.

I am curious to know how this was implemented in Unity ML-Agents. I have found a couple of papers online about hybrid control, such as http://proceedings.mlr.press/v100/neunert20a/neunert20a.pdf), from DeepMind, but I havenÂ´t figure out which method is being applied to Unity ML-Agents.

Does anyone knows which method for hybrid control is being utilized in Unity ML-Agents? If so, are there any papers I could read to understand more of the method? I am using ml-agents version 0.26.0

Which algorithm are you using?
In the case of PPO, the action probabilities for continuous and discrete actions can be multiplied together to give the joint probability (we assume the discrete and continuous actions are independent).
In the case of SAC, we use multiple value heads for each discrete action and use the regular continuous SAC action as input to the value and Q networks. In this case the continuous actions are â€śselected firstâ€ť and the discrete actions are conditioned on the continuous actions.
I will ask around for references

I do not understand what you mean by â€śspecify which discrete and continuous actions are multiplied against each otherâ€ť. All action probabilities are multiplied together. To update PPO, you need \pi(a | s) = \pi(a_continuous | s) * \pi(a_discrete | s)

Thank you for the paper! I understood that Unity right now multiplies all policies of actions (discrete and continuous) happening in the same time-step. My real issue is that I have a continuous action dependent on a discrete action (sorry for not clarifying this before). I saw that I have the option of conditioning a discrete action on a state (masking), but not conditioning a continuous action on a discrete action or stateâ€¦ Would that be possible? Thank you for all the help!

Currently no possible, it would require some trainer code changes and I am not sure it will work as we would expect. You could condition on the previous discrete action by feeding the last action as observation, but it will not be the same as taking an action â€śsimultaneouslyâ€ť. When you say â€śconditioning a continuous action on a stateâ€ť, it is already the case, the action depends on the state/observations provided to the agent already, so I am not sure what you mean here. What is the use case for having a continuous action depend on a discrete action (just curious)? It might be enough to have both discrete and continuous actions conditioned on the state for your use case.

I am developing a research project that simulates the operation of hydraulic bulb turbines under ocean signal, where the goal is to maximize the total energy attained by the system. The turbines can be set to several â€śmodesâ€ť of operation (Power Generation, Offline, Pump Mode, Idling Mode). While setting the modes of operation correspond to discrete actions, the continuous action would be inputting power Pin (with possible values within [0, MaxPin]), when in â€śPump Modeâ€ť (in any other mode of operation it would make sense that Pin = 0). ThatÂ´s why I though of conditioning a continuous action on a discrete action or state. I also tried performing hybrid control by penalizing the agent when Pin != 0 in modes different than pump mode, or just setting Pin = 0 independent of the agentâ€™s â€śPin outputâ€ť when in these modes, but in this scenario the agent would either choose Pin = 0, or Pin = MaxPin.

I have obtained some success by changing Pin output to discrete action (i.e. discretizing the range [0, MaxPin]). With this I can use action masking and the agent is utilizing the pumpâ€¦ But still, I think a continuous input for the pump would be ideal.

For continuous actions , it will be impossible for the agent to select the action 0 exactly. This is because continuous control will sample an action form a mean and a standard deviation and the entropy regularization prevents the standard deviation to be 0. I believe it is fine to ignore the MaxPin continuous action of the agent when a mode other than Pump is selected. Even if you were to condition the continuous action on discrete action, the continuous action will never be exactly 0. I am not sure this project is best solved with RL, maybe a planner or a supervised learning approach would give better results?

as the continuous action output for the agent. Using Debug.Log() during training I see (exact) output values of â€ś0â€ť, â€śPinMaxâ€ť and between. After training the agent outputs either â€ś0â€ť or â€śPinMaxâ€ť values, exactly.

From what I understood actionBuffers.ContinuousActions[0] outputs values between [-1, 1], so maybe Mathf.Clamp is allowing for exact values of extremes?

If a continuous action is clamped, it is very hard for the agent to learn that exact threshold. You are kind of discretizing the continuous action with this Clamp (it is equivalent to a max and it introduces a discontinuity/threshold in the continuous action). That makes it rather hard to learn because if the agent selects -0.1 or -0.9 there will be no difference and the agent will not get a strong learning signal. As an alternative, I would try
powerInputPumping = (actionBuffers.ContinuousActions[0] + 1f) / 2f * PinMax
and set powerInputPumping to zero if the discrete action requires it.
I am not sure it will learn better, but I think it is worth trying out.

How can we multiply the probabilities of the discrete and continuous actions? The discrete action can be assigned an actual probability (i.e., prob. mass fcn.), but in the case of the continuous action we can only calculate its probability density function. Do you mean we could simply multiply the discrete PMF by the continuous PDF? Isnâ€™t this wrong from a theoretical viewpoint?

You are right that probabilities are not calculated the same, but PPO does a ratio of current policy over old policy. The maximization roughly becomes

(current_policy_continuous_probability * current_policy_discrete_probability) / (old_policy_continuous_probability * old_policy_discrete_probability) * advantage
= (current_policy_continuous_probability / old_policy_continuous_probability) / (current_policy_discrete_probability * old_policy_discrete_probability) * advantage
Because of this ratio, it becomes reasonable to do this multiplication. I hope this makes sense.

I wanted to know if the final layer for the actor neural nework in PPO follows the same logic as SAC (below), where we have both a softmax and the moments of the gaussian distribution. I am also wondering if the implementation is straightforward as multi-output models for combined classification and regression (https://machinelearningmastery.com/neural-network-models-for-combined-classification-and-regression/).

I am asking this since I recently found another paper, where for Hybrid-PPO, 2 independent actor neural networks (one for discrete and the other for continuous action) are used (https://www.ijcai.org/proceedings/2019/0316.pdf), sharing only the first few layers to encode the state information.

Since the probability ratios of continuous and discrete actions are being multiplied against each other, that is probably being done inside the loss function. Since the gradient is obtained by differentiating the loss function, then both discrete and continuous actions are being parametrised by the same weights, i.e. they are outputted by the same neural network.

I guess this could also be done with two neural networks (one for continuous and one for discrete actions), but then you would need partial derivatives with respect to each neural network weights.