• CN:11-2187/TH
  • ISSN:0577-6686

机械工程学报 ›› 2024, Vol. 60 ›› Issue (10): 245-260.doi: 10.3901/JME.2024.10.245

• 智能决策规划 • 上一篇    下一篇

扫码分享

自动驾驶奖励函数贝叶斯逆强化学习方法

曾迪1, 郑玲1,2, 李以农1,2, 杨显通1   

  1. 1. 重庆大学机械与运载工程学院 重庆 400044;
    2. 重庆大学高端装备机械传动全国重点实验室 重庆 400044
  • 收稿日期:2023-11-20 修回日期:2024-01-25 出版日期:2024-05-20 发布日期:2024-07-24
  • 作者简介:曾迪,男,1996年出生,博士研究生。主要研究方向为自动驾驶决策规划。
    E-mail:cqumichael@163.com
    郑玲(通信作者),女,1963年出生,博士,教授,博士研究生导师。主要研究方向为智能汽车感知、决策与控制理论方法。
    E-mail:zling@cqu.edu.cn
    李以农,男,1961年出生,博士,教授,博士研究生导师。主要研究方向为车辆系统动力学与控制、智能汽车运动控制。
    E-mail:ynli@cqu.edu.cn
    杨显通,男,1998年出生,博士研究生。主要研究方向为智能汽车状态估计与动力学控制。
    E-mail:xiantongyang@cqu.edu.cn
  • 基金资助:
    国家自然科学基金(51875061)和中央高校基本科研业务费专项资金(2023CDJXY-021)资助项目。

Bayesian Inverse Reinforcement Learning-based Reward Learning for Automated Driving

ZENG Di1, ZHENG Ling1,2, LI Yinong1,2, YANG Xiantong1   

  1. 1. College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing 400044;
    2. State Key Laboratory of Mechanical Transmission, Chongqing University, Chongqing 400044
  • Received:2023-11-20 Revised:2024-01-25 Online:2024-05-20 Published:2024-07-24

摘要: 研究具有广泛场景适应性的自动驾驶汽车的驾驶策略,对实现安全、舒适、和谐的自动驾驶至关重要。深度强化学习以其优异的函数逼近和表示能力,在驾驶策略学习方面展示了巨大潜力。但设计适用于各种复杂驾驶场景的奖励函数极具挑战性,驾驶策略的场景泛化能力亟待提升。针对复杂驾驶场景下的奖励函数难以设计问题,考虑人类驾驶行为偏好,建立人类驾驶策略的近似似然函数模型,通过基于曲线插值的动作空间稀疏采样和近似变分推理方法,学习奖励函数的近似后验分布,建立基于贝叶斯神经网络的奖励函数模型。针对神经网络奖励函数不确定性产生的错误奖励,采用蒙特卡洛方法,对贝叶斯神经网络奖励函数不确定性进行度量,在最大化奖励函数的同时,对认知不确定性进行适当惩罚,提出基于奖励函数后验分布的不确定性认知型类人驾驶策略训练方法。采用NGSIM US-101高速公路数据集和nuPlan城市道路数据集,对所提出方法的有效性进行测试和验证。研究结果表明,基于贝叶斯逆强化学习的近似变分奖励学习方法,克服基于人工构造状态特征线性组合的奖励函数性能不佳的问题,实现奖励函数不确定性的度量,提升奖励函数对高维非线性问题的泛化能力,其学习的奖励函数及训练稳定性明显优于主流逆强化学习方法。在奖励函数中适当引入不确定性的惩罚,有利于提升驾驶策略的类人性、安全性及其训练的稳定性,提出的不确定性认知型类人驾驶策略显著优于行为克隆学习的策略和基于最大熵逆强化学习的策略。

关键词: 智能汽车, 自动驾驶, 近似变分奖励学习, 近似变分推理, 贝叶斯逆强化学习

Abstract: Studying driving policies with wide-ranging scenario adaptability is crucial to realizing safe, efficient, and harmonious automated driving. Deep reinforcement learning has shown great potential in driving policy learning with its excellent function approximation and representation capabilities. However, it is extremely challenging to design a reward function suitable for various complex driving scenarios, and driving strategies’ generalization ability needs to be urgently improved. Aiming at the difficulty in designing the reward function, an approximate likelihood model of human drivers’ driving policy is built considering their preferences and a method of learning an approximate posterior distribution over the reward function through sparse action sampling based on curve interpolation and approximate variational inference is proposed, resulting in a Bayesian neural network. Tackling the wrong rewards originate from the uncertainty of a reward function, an uncertainty-aware human-like driving policy learning method based on the posterior distribution over the reward function is proposed, which maximizes the reward while penalizing the epistemic uncertainty. The proposed methods are validated in simulated highway and urban driving scenarios in the NGSIM US-101 and nuPlan datasets. The results show that the proposed method overcomes the problem of poor performance of the reward function based on the linear combination of hand-crafted state features, models the uncertainty of the reward function, and improves the generalization ability of the reward function to high-dimensional nonlinear problems. The learned reward function and the learning stability are significantly better than the mainstream inverse reinforcement learning method. Moreover, penalizing the uncertainty of the reward function improves the human likeness and safety of the driving policy and the training stability. The proposed uncertainty-aware human-like driving policy significantly outperforms the driving policies based on behavior cloning and maximum entropy inverse reinforcement learning.

Key words: intelligent vehicle, automated driving, approximate variational reward learning, approximate variational inference, Bayesian inverse reinforcement learning

中图分类号: