Abstract:
To solve the problem of inaccurate value estimation caused by underestimation of
Q-values in the twin delayed deep deterministic policy gradient (TD3) algorithm in the field of quadruped robot skill learning, which leads to deteriorating learning performance, a randomized ensembled network-TD3 (RE-TD3) algorithm is proposed. First, this algorithm resembled multiple
Q-value networks and randomly selected
Q-value networks for evaluation, alleviating the problem of inaccurate value estimation and effectively improving policy performance. Second, appropriate reward functions were designed to correctly guide the gait learning task of quadruped robots. Finally, simulation experiments were conducted to validate the effectiveness of the proposed algorithm. Results show that the quadruped robot can learn good gaits by the RE-TD3 algorithm, and compared with TD3 algorithm, reward value increases by 32%, body stability increases by approximately 67%, and expected direction offset increases by 60%.