Hi everyone, I have been enjoying trying out ml agents. This is what I am trying to achieve.
Basic goal: the agent should learn to bring gold from mine to base. He needs to be able to do that back and forth, take 1 gold from the base, and bring it back to the mines.
Intermediate goal: the agent should take the shortest path between the mines and the goal since that’s more efficient.
So far I tried a few different approaches to achieve this, and I am not sure if the problem is in the script/reward system or some other settings.
Video of how the agent behaves:
https://www.dropbox.com/s/ccgw4gnf1gbmwkt/2021-01-22 00-13-54.mkv?dl=0
If the agent is given to observe Vector3, does he internally understands that this specific Vector3 belongs to an object? Because if he doesn’t, how can he ever learn to follow this Vector3 if this Vector3 is changing the position? If he is observing multiple Vector3 of multiple objects, how can he ever make a difference between those vectors?
I tried few approaches so far, one was to give him a reward only after he visits both the mine and the base. He would get a reward for every visit, but also his life would extend so he can do more connections. In case he doesn’t succeed to make a pair on time, he would die because his life would expire. He doesn’t receive a negative reward for dying, but he does receive a small negative reward while he is alive. I tried the end the episode after 1 delivery, but when testing, the agent wasn’t able to do multiple deliveries.
The distance of the mine and the base is always the same compared to the starting agent position, to make the training reward structure more consistent.
I also tried adding additional observers to the agent, like measuring the distance between the mine and the agent, measuring the distance between the base and the agent, I try to tell him which type of collision happened, who was the object he collided with, etc. I am not sure is these needed, but without it, it didn’t work either.
The code of the agent:
Settings:
https://www.dropbox.com/s/smm84x2q9003iw0/agent settings.png?dl=0
I think the problem might be in the settings, in initial tests there were many mines and bases being present, so I increased the vector observation space size - by 3 for every mine and base since I had to track their Vector3 positions. Also, just noticed I changed the vector action space size to 4, this should be 2 from my understanding since the agent is controlled by 2 inputs x and z. But in an earlier test, I had it on 2, and the agent still had problems.
behaviors:
Farmer:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000000000000000000000000000
time_horizon: 64
summary_freq: 10000
I hope someone can help, thanks in advance!