Do the hidden layers only consider the resulting image after all convolution and pooling steps? Or do they also work with intermediate data?
I have another use case where I’m training an agent to map a dungeon with different room sizes. The visual observation here is a top-down orthographic 32x32 b/w view of the mapped area. It starts out all black and as the agent moves around the rooms, grid cells get filled with grey values representing their accessibility (number of neighbouring walls). The agent receives a discrete reward for each new cell it detects. So far, I have been training it for 30M steps and slowly but surely, it gets more efficient overall. However, occasionally the agent still gets stuck in a looping pattern although it should be able to see a nearby exit to yet unmapped cells.
My suspicion is that the agent is somewhat aware of the overall layout, but fails to detect critical details like a 2px wide gap representing a doorway. Are convolution and pooling doing more harm than good in this case?