claim
The two-stage reinforcement learning process utilizes a simpler reward model to narrow the search space, effectively guiding the policy toward a subset of optimal solutions that offline cloning cannot easily identify.

Authors

Sources

Referenced by nodes (1)