I’m trying to train my rabbit agents to collect food in a closed environment. There are four walls around the area that the agents should not touch. If they stay in contact with the wall, they are punished every step.
After some time, instead of trying to collect food (which is available everywhere), they all start to hug the wall despite being heavily punished while doing so. I just can’t understand this kind of behavior, seems like they are trying to pursuit the most negative reward. As seen in the picture, the rabbits are touching the top-left wall corner.
Can you provide an overview of other positive/negative rewards your environment is assigning?
This often happens in cases where the policy cannot find any positive reward signal to optimize and tries to mitigate negative reward accumulation by ending the episode as fast as possible (suicide).
I’ve often seen border/wall seeking when the action space collapses and the agent chooses the same action continuously. You can tell if this is happening by looking at entropy on tensorboard, it will crash to 0 and the policy will collapse and not recover.
The agents are given observation of their hunger and thirst meter, maximum value for each is 110. They decrease overtime and if any of the two drops below 50, the agent is given negative reward every step they stay hungry or thirsty this way.
If the agent stays in collision with food object as seen in above pictures or with a lake (a water source), they are given positive reward. However, when hunger or thirst meter reaches maximum value and the agent keeps touching food/water source, it is given negative reward to prevent overeating.
The wall-hugging behavior starts to appear around step 40M. I’m now at step 78M and the situation has slightly improved (some of the agents start to do something other than hugging wall) but overall the agents are not acting as intended. They either hug the walls, hug the lakes or chase the food endlessly despite being full (and is punished for doing that). I don’t quite get why they cannot make a connection between chasing food and drinking water simultaneously and only stick to one at a time. It’s been more than 48 hours of training now and I think this is taking too much time for a task like this.
Thanks for the suggestion but I already normalized the observation space myself in the code.
So after adjusting the reward value I have got some better result
It now takes the agent ~7 hours before they start picking up the task. The problem was that overeating yields more negative reward than touching the wall. This made the agents think that touching food is bad (at least worse than touching wall) so they tried to avoid all kind of food and water. After making overeating and touching wall yield the same negative result, the performance is better as seen above.
Edit: Another change I made was to reduce the number of food objects around the agents. It seems that when the agents happen to overeat, they can’t easily “get away” from the food since there are just so many objects surrounding them and since they are very close, raycast of agent is blocked. From this reason the agents tried to go to a clear spot to get away from the “dangerous” food
You may see faster training by reducing the negative rewards even more, the intuition I use it that the good behavior signals need to drown out the bad ones in early training when the policy is still largely random. I find this also balances the explore/exploit balance a little better in later training.