GAIL vs Behavioral Cloning, what's the difference?

I couldn’t really find a detailed explanation in the docs. Some of the imitation config files (ml-agents/config/imitation at main · Unity-Technologies/ml-agents · GitHub) like Crawler include both, others like Pushblock just the GAIL reward signal.
How exactly do GAIL and behavioral cloning differ? When do I use which?
For my current project, I’d like my agent to start training with recorded demo data exclusively, and then gradually shift to training with extrinsic rewards.
Thanks!

There was once a doc talking about gail and BC, but in the recent release I can’t find it. Here’s the link:ml-agents/docs/Training-Imitation-Learning.md at 0.15.0 · Unity-Technologies/ml-agents · GitHub
BC will train your agent to mimic the demos. So you need a lot of demos to make it works.
Gail is more flexible in some way. It’s training another nn to evaluate how close the agents behaves compared to the demos. Gail could work well even there is only limited number of demos. Besides gail could work well togather with extrinsic reward while BC seems not that good.
Since you want to shift to extrinsic rewards as training goes on, I think gail may be a better approach here.

3 Likes

Thank you @YunaoShen !

The section in the docs that @YunaoShen referred to is now here: ml-agents/docs/ML-Agents-Overview.md at release_5 · Unity-Technologies/ml-agents · GitHub

If your end goal is to improve reinforcement learning training and maximize rewards, then I agree, GAIL+extrinsic reward is probably what you want, with possibly some BC training.

GAIL without an extrinsic reward should produce behavior that “acts like” the demonstration data, but this won’t necessarily maximize the environment reward.

1 Like

No problem. My last project also use gail so I still remember some.

Great, thanks! Must have missed that somehow…

Looks like once I’ve chosen my reward signals, I need to stick to them. If I start training with GAIL enabled, then I can’t pause later, comment it out in the config file and resume. Trying that gives me an error saying the configurations don’t match.
Right now, I only need GAIL for the initial training phase. I’m forcing my agent to mimick the demo without relying on extrinsic rewards. After a while, I’m reversing this: once the agent has learned the basic demo skills, it continues learning new ones using extrinsic rewards only. My current workaround here is to just swap the strength values at this point, changing GAIL 1 / extrinsic 0 to GAIL 0 / extrinsic 1.
While this seems to work fine, it doesn’t stop GAIL from still doing its thing, I’m seeing “GAIL Expert Estimate” and “GAIL Policy Estimate” progressing in Tensorboard. “GAIL Reward” has dropped to zero though as expected. I wonder if it would be practical to suspend any python side logic if its associated reward signal strength is set to zero? My naive assumption being that might free up resources and perhaps accelerate training?

So you’re training for a while, then stopping and running with --initialize-from and the modified weights?

I think you’re right that we might be able to save a few cycles evaluating the reward signal when the weight is 0 (somewhere around here).

I also think the “right” way to solve this is to offer a GAIL pretraining option, similar to BC. I’ll log a feature request for this, but can’t promise that we’ll get to it anytime soon (especially since all this code is getting ported to pytorch as we speak)…

1 Like

Internal tracked ID for GAIL pretraining is MLA-1249

I just resume training with the same model. As far as I can tell, --initialize-from would still require the same configuration anyway.

OK thanks, I’ll take a look.

Correction: Apparently I can disable GAIL after the initial training phase, if I remove its reward signal from my trainer_config.yaml as well as the configuration.yaml file in the results folder. This way, I’m not getting a configuration mismatch error and all Tensorboard graphs for GAIL just stop at this point. The model then resumes training with extrinsic rewards only as I was hoping it would.

1 Like

@celion_unity
Hi, I use the BC only to pretrain my agent, and after 500,000 steps, I get the fairly good results. And then I use gail only, close BC and extrinsic reward, I just comment out the BC and set the strength of extrinsic to 0. After a while, the mean reward goes down, I try to use PPO only and get the same result. And I think may be the steps in BC trainning are not enough. Then I start over to train my pretrain model, after 1,000,000 steps, the mean reward goes down again. Why did this happen? Perhaps my demonstrations is not good enough? I already collect about 470 episode on my game. My game like the example Match3, but more complex.

Hey jokerHHH, can you post a picture of your reward curve? It is possible the agent needs to unlearn a bit before going back up, especially if BC strength was high in the beginning.