Visual Attentional Network and Learning Method for Object Search and Recognition

doi:10.3901/JME.2019.11.123

Abstract

Abstract: A recurrent visual network is proposed to search and recognize an object simultaneously. The network can automatically select a sequence of local observations, and accurately localize and recognize objects by fusing those local detail appearance and rough context visual information. The method is more efficient than other methods with sliding windows or convolution on a whole image. Besides, a hybrid loss function is proposed to learn parameters of the multi-task network end-to-end. Especially, The combination of stochastic and object-awareness strategy is imported into visual fixation loss, which is beneficial to mine more abundant context and ensure fixation point close to object as fast as possible. A real-world dataset is built to verify the capacity of the method in searching and recognizing the object of interest including those small ones. Experiments illustrate that the method can predict an accurate bounding box for a visual object, and achieve higher searching speed. The source code will be opened to verify and analyze the method.

Key words: attentional model, fixation strategy, object detection, reinforcement learning

CLC Number:

TG156

LÜ Jie, LUO Fangying, YUAN Zejian. Visual Attentional Network and Learning Method for Object Search and Recognition[J]. Journal of Mechanical Engineering, 2019, 55(11): 123-130.

References

[1] VIOLA P,JONES M J. Robust real-time face detection[J]. International Journal of Computer Vision,2004,57(2):137-154.
[2] FELZENSZWALB P F,GIRSHICK R B,MCALLESTER D A,et al. Object detection with discriminatively trained part-based models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(9):1627-1645.
[3] SADEGHI M A,FORSYTH D. 30Hz object detection with DPM V5[C]//European Conference on Computer Vision (ECCV). September 6-12,2014,Zurich. Springer,2014:65-79.
[4] GIRSHICK R,DONAHUE J,DARRELL T,et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 23-28,2014,Columbus,OH. IEEE,2014:580-587.
[5] GIRSHICK R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision (ICCV). December 7-13,2015,Santiago,Chile. IEEE,2016:1440-1448.
[6] REN Shaoqing,HE Kaiming,GIRSHICK R,et al. Faster R-CNN:Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,39(6):1137-1149.
[7] REDMON J,DIVVALA S K,GIRSHICK R B,et al. You only look once:Unified,real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30,2016,Las Vegas,NV,USA. IEEE,2016:779-788.
[8] LIU Wei,ANGUELOV D,ERHAN D,et al. SSD:Single shot multibox detector[C]//European Conference on Computer Vision (ECCV). October 8-16,2016,Amsterdam. Springer,2016:21-37.
[9] SCHMIDHUBER J,HUBER R. Learning to generate artificial FOVEA Trajectories for target detection[J]. International Journal of Neural Systems,1991,02(01n02):125-134.
[10] TORRALBA A,OLIVA A,CASTELHANO M S,et al. Contextual guidance of eye movements and attention in real-world scenes:The role of global features in object search[J]. Psychological Review,2006,113(4):766-786.
[11] OLIVA A,TORRALBA A. The role of context in object recognition[J]. Trends in Cognitive Sciences,2007,11(12):1-527.
[12] BELL S,ZITNICK C L,BALA K,et al. Inside-outside net:Detecting objects in context with skip pooling and recurrent neural networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 27-30,2016,Las Vegas,NV,USA. IEEE,2016:2874-2883.
[13] HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[C]//European Conference on Computer Vision (ECCV),September 6-12,2014,Zurich. Springer,2014:346-361.
[14] LAROCHELLE H,HINTON G E. Learning to combine foveal glimpses with a third-order Boltzmann machine[J]. Advances in Neural Information Processing Systems (NIPS),2010:1243-1251.
[15] TANG Yichuan,SALAKHUTDINOV R. Learning Stochastic feedforward neural networks[J]. Advances in Neural Information Processing Systems,2013,1:530-538.
[16] REZENDE D J,MOHAMED S,WIERSTRA D,et al. Stochastic backpropagation and approximate inference in deep generative models[C]//International Conference on Machine Learning,June 21-26,2014,Beijing. 2014:1278-1286.
[17] MNIH V,HEESS N,GRAVES A,et al. Recurrent models of visual attention[J]. Advances in Neural Information Processing Systems,2014,1:2204-2212.
[18] GRAVES A,WAYNE G,REYNOLDS M,et al. Hybrid computing using a neural network with dynamic external memory[J]. Nature,2016,538(7626):471-476.
[19] RANZATO M. On learning where to look[J]. arXiv:Computer Vision and Pattern Recognition,2014,1:1405.5488.
[20] DENIL M,BAZZANI L,LAROCHELLE H,et al. Learning where to attend with deep architectures for image tracking[J]. Neural Computation,2012,24(8):2151-2184.
[21] XU K,BA J,KIROS R,et al. Show,attend and tell:Neural image caption generation with visual attention[C]//International Conference on Machine Learning,July 6-11,2015,Lille. 2015:2048-2057.
[22] BAZZANI L,LAROCHELLE H,MURINO V,et al. Learning attentional policies for tracking and recognition in video with deep networks[C]//International Conference on Machine Learning,June 28-July 2,2011,Bellevue. 2011:937-944.
[23] GREGOR K,DANIHELKA I,GRAVES A,et al. DRAW:A recurrent neural network for image generation[C]//International Conference on Machine Learning,July 6-11,2015,Lille. 2015:1462-1471.
[24] CAICEDO J C,LAZEBNIK S. Active object localization with deep reinforcement learning[C]//IEEE International Conference on Computer Vision (ICCV),December 7-13,2015,Santiago,Chile. IEEE,2016:2488-2496.
[25] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning,1992,8(3):229-256.
[26] SUTTON R S,MCALLESTER D A,SINGH S P,et al. Policy gradient methods for reinforcement learning with function approximation[J]. Advances in Neural Information Processing Systems,1999,1:1057-1063.
[27] LECUN Y,BOSER B E,DENKER J S,et al. Backpropagation applied to handwritten zip code recognition[J]. Neural Computation,1989,1(4):541-551.