Citation: | WEI Yingzi, LIU Wangjie. Visual Content Interpretation Method for Superframe Cutting of Long Videos[J]. Journal of Beijing University of Technology, 2024, 50(7): 805-813. DOI: 10.11936/bjutxb2022090009 |
In view of the problems existing video description methods based on encoding and decoding, such as long video, insufficient ability of visual feature extraction or insufficient ability to capture key clips under the condition of frequent video scene switching, a video captioning method based on superframe cutting long video was proposed. First, a superframe extraction algorithm was proposed to meet the limitation of video browsing time and shorten the retrieval time by calculating the ratio of key video time. A two-layer screening model was constructed to adaptively extract superframes, filter redundant keyframes, and perform multi-scene semantic description.The retained keyframes were embedded in the surrounding frames. The deep network model adopted small convolution kernels, aiding to the pool sampling domain to obtain more visual features of video. It overcomed the difficulty that the classical video captioning method could not directly process long videos. Finally, a long-short-term memory model was adopted to decode and generate video captions for the piecewise interpretation of video information instead of the recurrent neural network. Tested on YouTube data set videos, synthetic videos and long surveillance videos, and a variety of machine translation evaluation indexes were used to evaluate the performance of this method, which obtained different degrees of improvement. Results show that this method can obtain better description of video segments when dealing with challenges such as frequent scene switching and long video.
[1] |
莫秀云, 陈俊洪, 杨振国, 等. 基于人类演示视频的机器人指令生成框架[J]. 机器人, 2022, 44(2): 186-194, 202. https://www.cnki.com.cn/Article/CJFDTOTAL-JQRR202202006.htm
MO X Y, CHEN J H, YANG Z G, et al. A robotic command generation framework based on human demonstration videos[J]. Robot, 2022, 44(2): 186-194, 202. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-JQRR202202006.htm
|
[2] |
DING S T, QU S R, XI Y L, et al. A long video caption generation algorithm for big video data retrieval[J]. Future Generation Computer Systems, 2019, 93: 583-595. doi: 10.1016/j.future.2018.10.054
|
[3] |
WANG Z, SHE Q, SMOLIC A. ACTION-Net: multipath excitation for action recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 13214-13223.
|
[4] |
HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 7132-7141.
|
[5] |
LI Y, JI B, SHI X, et al. TEA: temporal excitation and aggregation for action recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 909-918.
|
[6] |
WANG L M, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1895-1904.
|
[7] |
WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//European Conference on Computer Vision. Cham: Springer, 2016: 20-36.
|
[8] |
LIU Z, LUO D, WANG Y, et al. TEINet: towards an efficient architecture for video recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI, 2020, 34(7): 11669-11676.
|
[9] |
XU G H, NIU S C, TAN M K, et al. Towards accurate text-based image captioning with content diversity exploration[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 12637-12646.
|
[10] |
SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-06-27]. https://arxiv.org/abs/1409.1556.
|
[11] |
曹磊, 万旺根, 侯丽. 基于多特征的视频描述生成算法研究[J]. 电子测量技术, 2020, 43(16): 99-103. https://www.cnki.com.cn/Article/CJFDTOTAL-DZCL202016020.htm
CAO L, WAN W G, HOU L. Research of video captioning algorithm based on multi-feature[J]. Electronic Measurement Technology, 2020, 43(16): 99-103. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-DZCL202016020.htm
|
[12] |
VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156-3164.
|
[13] |
LIN C Y. Rouge: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Stroudsburg, PA: Association for Computational Linguistics, 2004: 74-81.
|
[14] |
侯静怡, 齐雅昀, 吴心筱, 等. 跨语言知识蒸馏的视频中文字幕生成[J]. 计算机学报, 2021, 44(9): 1907-1921. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX202109009.htm
HOU J Y, QI Y Y, WU X X, et al. Cross-lingual knowledge distillation for Chinese video captioning[J]. Chinese Journal of Computers, 2021, 44(9): 1907-1921. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX202109009.htm
|
[15] |
BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA: Association for Computational Linguistics, 2005: 65-72.
|
[16] |
PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2002: 311-318.
|
[17] |
VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 4566-4575.
|
[18] |
THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language and vision to generate natural language descriptions of videos in the wild[R]. Austin, TX: University of Texas, 2014.
|
[19] |
VENUGOPALAN S, XU H J, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA : Association for Computational Linguistics, 2015: 1494-1504.
|
[20] |
YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015: 4507-4515.
|
[21] |
张晓宇, 张云华. 基于融合特征的视频关键帧提取方法[J]. 计算机系统应用, 2019, 28(11): 176-181. https://www.cnki.com.cn/Article/CJFDTOTAL-XTYY201911025.htm
ZHANG X Y, ZHANG Y H. Video keyframe extraction method based on fusion feature[J]. Computer Systems & Applications, 2019, 28(11): 176-181. (in Chinese) https://www.cnki.com.cn/Article/CJFDTOTAL-XTYY201911025.htm
|
[1] | SUN Yanfeng, ZHANG Kun, HU Yongli. Action Feature Representation and Recognition Based on Depth Video[J]. Journal of Beijing University of Technology, 2016, 42(7): 1001-1008. DOI: 10.11936/bjutxb2016010029 |
[2] | LIU Wei, JIA Ke-bin, WANG Zhuo-zheng, ZHUANG Xin-yue. Video Retrieval Method Based on Video Fingerprints and Spatio-Temporal Information[J]. Journal of Beijing University of Technology, 2014, 40(2): 200-205. DOI: 10.3969/j.issn.0254-0037.2014.02.007 |
[3] | JIA Ke-bin, ZHANG Yuan. Multi-camera Video Stitching Based on Foreground Extraction[J]. Journal of Beijing University of Technology, 2012, 38(7): 1057-1061. DOI: 10.3969/j.issn.0254-0037.2012.07.018 |
[4] | ZHAO Shi-wei, ZHUO Li, SUN Shao-qing, SHEN Lan-sun. Data Mining-based Video Shot Classification Method[J]. Journal of Beijing University of Technology, 2012, 38(5): 721-726. DOI: 10.3969/j.issn.0254-0037.2012.05.016 |
[5] | XIAO Chuang-bo, WANG Shou-dao, SI Wei. High Bitrate Information Hiding Technique for Video in Video[J]. Journal of Beijing University of Technology, 2011, 37(8): 1249-1254,1261. DOI: 10.3969/j.issn.0254-0037.2011.08.021 |
[6] | CHEN Xiu-xin, JIA Ke-bin, DENG Zhi-pin, ZHUANG Xin-yue. A New Video Copy Detection Algorithm of High Robustness[J]. Journal of Beijing University of Technology, 2011, 37(5): 691-696. DOI: 10.3969/j.issn.0254-0037.2011.05.008 |
[7] | HU Hong-yu, WANG Dian-hai, LI Zhi-hui, YANG Xi-rui, WANG Qing-nian. Motion Targets Tracking Algorithm Based on Video Surveillance[J]. Journal of Beijing University of Technology, 2010, 36(12): 1683-1690. DOI: 10.3969/j.issn.0254-0037.2010.12.017 |
[8] | WANG Yue-zong, HUANG Wei, LI De-sheng. Video Detecting System of Book Quality in High Speed[J]. Journal of Beijing University of Technology, 2010, 36(3): 294-299. DOI: 10.3969/j.issn.0254-0037.2010.03.002 |
[9] | ZHOU Yi-hua, SHI Wei-min, DUAN Li-juan. Highlight Extraction Based on Goal Detection in Soccer Video[J]. Journal of Beijing University of Technology, 2009, 35(1): 103-107. DOI: 10.3969/j.issn.0254-0037.2009.01.018 |
[10] | JIA Ke-bin, DENG Zhi-pi, ZHUANG Xin-yue. Video Similarity Matching Algorithm Based on Spatiotemporal Feature[J]. Journal of Beijing University of Technology, 2008, 34(12): 1250-1253. DOI: 10.3969/j.issn.0254-0037.2008.12.004 |