Hey everyone,
I am trying to create an Agent that accesses the state of a forest and then decides wether to chop a tree or not. The general setup is that each tree in the forest spawns new trees with a certain probability, so over time the forest is fully stacked.
The Agent gets as Input the number of trees in the forest and how much wood it has generated as of yet.
It gets rewarded if it chops a tree and there are still trees remaining in the forest and it gets punished if it chops the last tree. (I also tried punishing it a little if it does not chop a tree in every step, but this yielded the same result.)
The behaviour I expect would be: That the agent chops wood if there is more than a single tree and that it does not if there is just one tree left.
However so far it always ends up in a policy, where it just does not want to chop at all. (Quite the druid…)
Any suggestions on how to proceed? Or is this generally a problem that can not be solved that well by reinforcement leraning?
Also here is my config file:
behaviors:
Chopper:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 3
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
strength: 0.02
gamma: 0.99
encoding_size: 256
learning_rate: 3.0e-4
max_steps: 100000
time_horizon: 1000
summary_freq: 1000
threaded: true
Any help is greatly appreciated!
Cheers