Turnbased behaviour question

I’m trying to use the ML-Agents for a simplified Game of Splendor. And I have a problem with how to set up the correct behaviours to do a turn.

For now there are only two actions each turn the agent should take

  • Pick currency (either 2 or 3 according to some rules, can be from different stacks)
  • Buy a card (from 12 different ones)

The only close example I could find from the provided ones is the Match 3. And this is where I’m not sure how to proceed. The match 3 example is using as far as I can tell a single discrete action and the branch size equals to the move. The problem is, the actions I need are either picking currency OR picking card. If I have two discrete actions, they are both giving results.

Now if I change it to only one discrete action with a branch size of 2 (for either picking currency or picking card), I run into the problem of getting an actual value for this action. Do I use

  • 3 continuous actions for everything

  • 4 (3 for the currency picking and one for the card)

  • 6 (2 for 2 currency, 3 for 3 currency and 1 for the card)
    (the action branch could be divided into branch size 3 to include 2 and 3 currency as an option)

The currency is divided into multiple stacks and directly given as input for the agent. My main problem is how to get a list of enums/ ints back that I can build the stack I want to pick from the vectors. The observation is added as hot encoding for the enums and the amount normalized. And as output I would like to have in the best case a List (or int for that it matters).

public enum Currency {Black=0, Red=1, Blue=2, Green=3, White=4}

private float NormalizeValue(float currentValue, float minValue, float maxValue) {
return (currentValue - minValue) / (maxValue - minValue);
}

public override void CollectObservations(VectorSensor sensor) {
var amount = Enum.GetValues(typeof(Currency)).Length;
for (int index = 0; index < amount; index++) {
sensor.AddOneHotObservation(index, amount);
sensor.AddObservation(NormalizeValue(gameLogic.boardCurrency[index], 0, 8));
}
}```

Code example is only for the Currency, card is done in similar way.

I solved it in the end with only using discrete actions and action masks. Since all discrete branches are always returning a value, I double evaluate a move, the firs evaluation is which action is picked and for the second evaluation the wanted parameter is returned. In the praxis it looks like this:

Branch 0 - Option 0 to 3

  • 0 Picked last turn
  • 1 Draw Card
  • 2 Pick Currency
  • 3 Return Currency
    Branch 1 - Options 0 - 12
  • 0 Do nothing
  • 1 to 12 card index
    Branch 2 - Options 0 - X
  • 0 Do nothing
  • 1 to X Currency combination encoded as int value
    Branch 3 - Options 0 - X
  • 0 Do nothing
  • 1 to X Currency combination encoded as int value

On the first eval the action mask is deactivating the Branch 1-3, only leaving the 0 option available and deactivating the option 0 in branch 0 as well.
On the second eval it now uses the result from Branch 0 in the action mask to deactivate all branches except the one that got picked in the first eval. in branch 0 and deactivates as well the option 0 in the picked branch.

Example results would look like

  • 1/0/0/0 - 0/6/0/0 → Pick the 6th card on the board
  • 2/0/0/0 - 0/0/51/0 → Pick the currency combination that is encoded as 51 (Red/Blue/Green)

This might not be perfect but the action mask can be used to limit the choices the agent has to take. The action mask is as well filtering any illegal move away, which speeds up the training. The used rewards are

  • -0.05 per turn (but only applied when Branch 0 is not 0 to not apply it twice)

  • +1 if game is won

  • +0.X depending on amount of victory points bought on card

  • -0.15 for each returned currency for over the limit. (not when buying a card but when currency > 10)