Bayesian Inverse Reinforcement Learning-based Reward Learning for Automated Driving

doi:10.3901/JME.2024.10.245

Abstract

Abstract: Studying driving policies with wide-ranging scenario adaptability is crucial to realizing safe, efficient, and harmonious automated driving. Deep reinforcement learning has shown great potential in driving policy learning with its excellent function approximation and representation capabilities. However, it is extremely challenging to design a reward function suitable for various complex driving scenarios, and driving strategies’ generalization ability needs to be urgently improved. Aiming at the difficulty in designing the reward function, an approximate likelihood model of human drivers’ driving policy is built considering their preferences and a method of learning an approximate posterior distribution over the reward function through sparse action sampling based on curve interpolation and approximate variational inference is proposed, resulting in a Bayesian neural network. Tackling the wrong rewards originate from the uncertainty of a reward function, an uncertainty-aware human-like driving policy learning method based on the posterior distribution over the reward function is proposed, which maximizes the reward while penalizing the epistemic uncertainty. The proposed methods are validated in simulated highway and urban driving scenarios in the NGSIM US-101 and nuPlan datasets. The results show that the proposed method overcomes the problem of poor performance of the reward function based on the linear combination of hand-crafted state features, models the uncertainty of the reward function, and improves the generalization ability of the reward function to high-dimensional nonlinear problems. The learned reward function and the learning stability are significantly better than the mainstream inverse reinforcement learning method. Moreover, penalizing the uncertainty of the reward function improves the human likeness and safety of the driving policy and the training stability. The proposed uncertainty-aware human-like driving policy significantly outperforms the driving policies based on behavior cloning and maximum entropy inverse reinforcement learning.

Key words: intelligent vehicle, automated driving, approximate variational reward learning, approximate variational inference, Bayesian inverse reinforcement learning

CLC Number:

U461

ZENG Di, ZHENG Ling, LI Yinong, YANG Xiantong. Bayesian Inverse Reinforcement Learning-based Reward Learning for Automated Driving[J]. Journal of Mechanical Engineering, 2024, 60(10): 245-260.

References

[1] HWAN J J，COWLAGI R V，PETERS S C，et al. Optimal motion planning with the half-car dynamical model for autonomous high-speed driving[C]//IEEE. 2013 American Control Conference. New York:IEEE，2013:188-193.
[2] TEHRANI H，DO Q H，EGAWA M，et al. General behavior and motion model for automated lane change[C]//IEEE. 2015 IEEE Intelligent Vehicles Symposium (IV). New York:IEEE，2015:1154-1159.
[3] CHEN J，ZHAO P，MEI T，et al. Lane change path planning based on piecewise bezier curve for autonomous vehicle[C]//IEEE. Proceedings of 2013 IEEE International Conference on Vehicular Electronics and Safety. New York:IEEE，2013:17-22.
[4] MILLER C，PEK C，ALTHOFF M. Efficient mixed-integer programming for longitudinal and lateral motion planning of autonomous vehicles[C]//IEEE. 2018 IEEE Intelligent Vehicles Symposium (IV). New York:IEEE，2018:1954-1961.
[5] CARDOSO V，OLIVEIRA J，TEIXEIRA T，et al. A model-predictive motion planner for the iara autonomous car[C]//IEEE. 2017 IEEE International Conference on Robotics and Automation (ICRA). New York:IEEE，2017:225-230.
[6] ZHU Z，ZHAO H. A survey of deep RL and IL for autonomous driving policy learning[J]. IEEE Transactions on Intelligent Transportation Systems，2021，23(9):14043-14065.
[7] BOJARSKI M，DEL T D，DWORAKOWSKI D，et al. End to end learning for self-driving cars[J]. arXiv Preprint arXiv:1604.07316，2016.
[8] ROSS S，BAGNELL D. Efficient reductions for imitation learning[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia:PMLR，2010:661-668.
[9] BUECHEL M，KNOLL A. Deep reinforcement learning for predictive longitudinal control of automated vehicles[C]//IEEE. 201821st International Conference on Intelligent Transportation Systems (ITSC). New York:IEEE，2018:2391-2397.
[10] FOLKERS A，RICK M，BÜSKENS C. Controlling an autonomous vehicle with deep reinforcement learning[C]//IEEE. 2019 IEEE Intelligent Vehicles Symposium (IV). New York:IEEE，2019:2025-2031.
[11] CHEN J，YUAN B，TOMIZUKA M. Model-free deep reinforcement learning for urban autonomous driving[C]//IEEE. 2019 IEEE Intelligent Transportation Systems Conference (ITSC). New York:IEEE，2019:2765-2771.
[12] HART P，RYCHLY L，KNOLL A. Lane-merging using policy-based reinforcement learning and post-optimization[C]//IEEE. 2019 IEEE Intelligent Transportation Systems Conference (ITSC). New York:IEEE，2019:3176-3181.
[13] NG A Y，RUSSELL S. Algorithms for inverse reinforcement learning[C]//LANGLEY P. in Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000). California:Morgan Kaufmann，2000:663-670.
[14] ABBEEL P，NG A Y. Apprenticeship learning via inverse reinforcement learning[C]//Proceedings of the Twenty-first International Conference on Machine Learning. New York:Association for Computing Machinery，2004:1.
[15] RATLIFF N D，BAGNELL J A，ZINKEVICH M A. Maximum margin planning[C]//ICML'06:Proceedings of the 23rd International Conference on Machine Learning. New York:Association for Computing Machinery，2006:729-736.
[16] SILVER D，BAGNELL J A，STENTZ A. Learning autonomous driving styles and maneuvers from expert demonstration[C]//DESAI J P，DUDEK G，KHATIB O，et al. Experimental Robotics:The 13th International Symposium on Experimental Robotics. Heidelberg:Springer，2013:371-386.
[17] FAN H，XIA Z，LIU C，et al. An auto-tuning framework for autonomous vehicles[J]. arXiv Preprint arXiv:1808. 04913，2018.
[18] ZIEBART B D，MAAS A L，BAGNELL J A，et al. Maximum entropy inverse reinforcement learning[C]//AAAI. Proceedings of the AAAI Conference on Artificial Intelligence. California:AAAI Press，2008，8:1433-1438.
[19] LEVINE S，KOLTUN V. Continuous inverse optimal control with locally optimal examples[J]. arXiv Preprint arXiv:1206.4617，2012.
[20] SUN L，ZHAN W，TOMIZUKA M. Probabilistic prediction of interactive driving behavior via hierarchical inverse reinforcement learning[C]//IEEE. 201821st International Conference on Intelligent Transportation Systems (ITSC). New York:IEEE，2018:2111-2117.
[21] HU Y，SUN L，TOMIZUKA M. Generic prediction architecture considering both rational and irrational driving behaviors[C]//IEEE. 2019 IEEE Intelligent Transportation Systems Conference (ITSC). New York:IEEE，2019:3539-3546.
[22] BOULARIAS A，KOBER J，PETERS J. Relative entropy inverse reinforcement learning[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Fort Lauderdale:PMLR，2011:182-189.
[23] KALAKRISHNAN M，PASTOR P，RIGHETTI L，et al. Learning objective functions for manipulation[C]//IEEE. 2013 IEEE International Conference on Robotics and Automation. New York:IEEE，2013:1331-1336.
[24] WU Z，SUN L，ZHAN W，et al. Efficient sampling-based maximum entropy inverse reinforcement learning with application to autonomous driving[J]. IEEE Robotics and Automation Letters，2020，5(4):5355-5362.
[25] HUANG Z，WU J，LV C. Driving behavior modeling using naturalistic human driving data with inverse reinforcement learning[J]. IEEE Transactions on Intelligent Transportation Systems，2021，23(8):10239-10251.
[26] XU D，DING Z，HE X，et al. Learning from naturalistic driving data for human-like autonomous highway driving[J]. IEEE Transactions on Intelligent Transportation Systems，2020，22(12):7341-7354.
[27] SHARIFZADEH S，CHIOTELLIS I，TRIEBEL R，et al. Learning to drive using inverse reinforcement learning and deep q-networks[J]. arXiv Preprint arXiv:1612.03653，2016.
[28] CHAN A J，VAN D S M. Scalable bayesian inverse reinforcement learning[J]. arXiv Preprint arXiv:2102.06483，2021.
[29] GAL Y. Uncertainty in deep learning[D]. Cambridge:University of Cambridge，2016.
[30] RAMACHANDRAN D，AMIR E. Bayesian inverse reinforcement learning[C]//IJCAI，Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. California:AAAI Press，2007，7:2586-2591.
[31] CHOI J，KIM K E. MAP inference for bayesian inverse reinforcement learning[J]. Advances in Neural Information Processing Systems，2011，24:1-11.
[32] QIAO Q，BELING P A. Inverse reinforcement learning with Gaussian process[C]//IEEE. Proceedings of the 2011 American Control Conference. New York:IEEE，2011:113-118.
[33] DIMITRAKAKIS C，ROTHKOPF C A. Bayesian multitask inverse reinforcement learning[C]//SANNER S，HUTTER M. Recent Advances in Reinforcement Learning:9th European Workshop，EWRL 2011，Athens，Greece，September 9-11，2011，Revised Selected Papers 9. Heidelberg:Springer，2012:273-284.
[34] CHOI J，KIM K E. Nonparametric Bayesian inverse reinforcement learning for multiple reward functions[J]. Advances in Neural Information Processing Systems，2012，25:1-8.
[35] BLEI D M，KUCUKELBIR A，MCAULIFFE J D. Variational inference:A review for statisticians[J]. Journal of the American statistical Association，2017，112(518):859-877.
[36] GAL Y，GHAHRAMANI Z. Dropout as a bayesian approximation:Representing model uncertainty in deep learning[C]//Proceedings of The 33rd International Conference on Machine Learning. New York:PMLR，2016:1050-1059.
[37] BISHOP C M，NASRABADI N M. Pattern recognition and machine learning[M]. New York:Springer，2006.
[38] JARRETT D，BICA I，VAN D S M. Strictly batch imitation learning by energy-based distribution matching[C]//Advances in Neural Information Processing Systems. New York:Curran Associates，Inc.，2020，33:7354-7365.
[39] HINTON G E，SRIVASTAVA N，KRIZHEVSKY A，et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv Preprint arXiv:1207.0580，2012.
[40] KONG J，PFEIFFER M，SCHILDBACH G，et al. Kinematic and dynamic vehicle models for autonomous driving control design[C]//IEEE，2015 IEEE Intelligent Vehicles Symposium (IV). New York:IEEE，2015:1094-1099.
[41] ALEXIADIS V，COLYAR J，HALKIAS J，et al. The next generation simulation program[J]. Institute of Transportation Engineers. ITE Journal，2004，74(8):22.
[42] KARNCHANACHARI N，GEROMICHALOS D，TAN K S，et al. Towards learning-based planning:The nuPlan benchmark for real-world autonomous driving[J]. arXiv Preprint arXiv:2403.04133，2024.
[43] HENAFF M，CANZIANI A，LECUN Y. Model-predictive policy learning with uncertainty regularization for driving in dense traffic[J]. arXiv Preprint arXiv:1901.02705，2019.
[44] KINGMA D P，BA J. Adam:A method for stochastic optimization[J]. arXiv Preprint arXiv:1412.6980，2014.
[45] FUJIMOTO S，HOOF H，MEGER D. Addressing function approximation error in actor-critic methods[C]//Proceedings of the 35th International Conference on Machine Learning. New York:PMLR，2018:1587-1596.
[46] GAO J，SUN C，ZHAO H，et al. VectorNet:Encoding HD maps and agent dynamics from vectorized representation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington:IEEE，2020:11525-11533.
[47] BOGDOLL D，BREITENSTEIN J，HEIDECKER F，et al. Description of corner cases in automated driving:Goals and challenges[C]//IEEE. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). New York:IEEE，2021:1023-1028.