I’m trying to make a simple Car Driver Agent that follows a track but I’m having trouble getting it to work.
The main issue seems to be that the Agent never tries all possible actions, it always ends up repeating the same actions over and over again so the reward never improves.
I have two Discrete Action vectors, each with 3 possible values.
Accelerate, DontMove, BrakeReverse
TurnLeft, DontTurn, TurnRight
From what I understand the way machine learning works is by first testing out all actions at random and then seeing which of those random Actions results in a good reward.
So I would expect the agent to try all combinations of those Actions in order to find one that seemed to work.
But what I’m getting is the Agent mainly just does a single action which is different every time I run the game.
I’ve set the reward based on distance traveled along the track.
And the Observations that I’m using are
Current Position
Next Checkpoint Position
Raycast Distance to Wall At Angle 0
Raycast Distance to Wall At Angle +45
Raycast Distance to Wall At Angle -45
So my issue is I don’t even know what is wrong, I’ve tried messing around with pretty much every single parameter and I cannot get the Agent to try doing all actions in order to find the ones that work.
It is strange that the agent isn’t exploring the action space more fully. It may be hard to diagnose without looking at the code but I think there are a few setup things to clarify which may give us more information into what’s causing the issue.
A few questions:
Does your environment have end conditions i.e. does the agent episode restart if it runs off the track/into a wall or hits a max step?
Can you describe your reward function more explicitly?
A few things stand out to me here:
(1) The learning rate and beta seem a little low. I’d recommend initially trying .0003 and .001, respectively. Increasing beta incentivizes the agent to behave more randomly which may help your issue but I suspect there’s something else at play here.
(2) The time horizon seems a little large for this problem. I’d try turning that down to 1000.
(3) The observations of current position and next checkpoint position strike me as a little odd if the agent’s only goal is to move as far forward as possible. If the objective is just to move forward and not hit the walls, the raycasts should definitely be enough.
Yes the episode ends and restarts upon hitting a wall
Right now my reward function gives +1 for each checkpoint and (-time * .1f) in OnActionReceived (to discourage the agent from standing still until the time ends)
I’ve also tried -10 for hitting a wall but it didn’t seem to help.
I’ve also tried adding the Curiosity parameter but it didn’t seem to do anything
Is it better to have fewer or more observations? I also tried adding the current rotation angle, transform.forward and direction to next checkpoint but again didn’t notice any difference.
Maybe the issue is simply in terms of volume of steps? For this kind of problem how much do I need to train before I see any kind of results?
I’ll try the values you mentioned and see if it helps, thanks!
Should I be using Stacked Vectors? Currently have it set to 1
What about Decision Period? I have it set to 5 with Take Actions Between Decisions enabled.
If I set a beta of 1.0 shouldn’t that mean that it behaves completely randomly which should guarantee that it hits all possible actions?
Even with that I still have the same issue, it doesnt explore more than 2-3 possible actions.
I’m a bit suspicious of (-time * .1f). What is the value of time? It’s possible that if this reward is too negative, the agent is learning to end episodes as quickly as possible to minimize penalty incurred. Are your episode lengths really short?
Since the agent can end the episode by hitting the wall, it may even make sense to give a survival bonus. However, to simplify this problem, can you just try administering reward for forward velocity per timestep? This way, the agent is (1) encouraged to move forward quickly and (2) stay alive so that it can keep getting reward for moving forward quickly.
How sparse are your checkpoints? If they are very far away from each other, curiosity could help but I’m not sure. My feeling is the agent should learn to discover checkpoints.
As far as observations, it’s not necessarily true that fewer is better (True, fewer is better for computational cost). What’s important in observations is that they capture a sufficient amount of information so that an agent can decide what to do next. From what I understand, it seems the agent needs to learn to move forward and not hit any walls. To do this, the agent needs to know where the walls are relative to itself and which direction is forward. The raycasts and a fixed vector pointing forward may be all you need for this. With this in mind, going back to the above reward suggestion, you can use velocity and maybe the dot product between the agents forward vector and the fixed forward vector.
I don’t think stacked vectors are necessary for this. Those are helpful when the agent needs a very short memory of past events. Decision period is the number of fixed updates to elapse between decisions. 5 is reasonable and I would caution against tuning this until you have a real need to have the agent make more decisions in a given amount of time.
As far as training volume, if the agent doesn’t start showing an intention to move forward/avoid walls after maybe 70k timesteps something is probably off. I’m just basing this on my judgement of your problem though. It’s possible that it’s more complicated than I’m understanding and maybe would require more time.
Many thanks! Setting normalize to true did it! Now every time I run the agent does indeed always try out all possible actions on roughly the same rate as I would expect.
The penalty I have is AddReward(-Time.fixedDeltaTime * .1f); whereas a checkpoint gives +1.
I’ll try without that and with normalize and see if the agent no longer gets stuck on a never ending rotation.
The one thing I did previously that greatly helped was indeed making checkpoints much closer. Initially I had them about 10 units and when put more of them 1 unit apart then the agent started to learn.
Is it also possible that this is the kind of problem that would greatly benefit from imitation learning? I’m currently looking into how that works.
Setting normalize to true was indeed the key to solve my problem.
With that and after spending a few hours training and setting up checkpoint positions I now have a working AI Driver.
@andrewcoh_unity
Seem to me that in certain situations without normalizing the network collapse (“converge”) really fast to some weird local minimum. It happened to me in continuous control and for codemonkey in discrete control.
Maybe it is a learning rate issue, but anyway, that is kind of problematic.
What is the best way to address this? open bug in github?
@CodeMonkeyYT
Do you have by any chance the entropy graph of the experiment with normalize = false?
Hmm looking at the Entropy it seems that with normalize false it was falling to 0
Whereas with it set to true it seems to stay stable as it consistently gains more rewards.
I was running the training with a different name every time so it’s kind of hard to analyze the graphs
It’s hard to tell whats going on from inspecting the graphs.
I do not think this is a bug. My guess is that the values of some of the vector observations were maybe too large (or too different) and dominating the activations of the network. Normalization maps everything to mean 0. It is strange though and I’ll think more about it.