魏英姿, 刘王杰. 长视频的超级帧切割视觉内容解释方法[J]. 北京工业大学学报, 2024, 50(7): 805-813. DOI: 10.11936/bjutxb2022090009
    引用本文: 魏英姿, 刘王杰. 长视频的超级帧切割视觉内容解释方法[J]. 北京工业大学学报, 2024, 50(7): 805-813. DOI: 10.11936/bjutxb2022090009
    WEI Yingzi, LIU Wangjie. Visual Content Interpretation Method for Superframe Cutting of Long Videos[J]. Journal of Beijing University of Technology, 2024, 50(7): 805-813. DOI: 10.11936/bjutxb2022090009
    Citation: WEI Yingzi, LIU Wangjie. Visual Content Interpretation Method for Superframe Cutting of Long Videos[J]. Journal of Beijing University of Technology, 2024, 50(7): 805-813. DOI: 10.11936/bjutxb2022090009

    长视频的超级帧切割视觉内容解释方法

    Visual Content Interpretation Method for Superframe Cutting of Long Videos

    • 摘要: 针对现有基于编码解码的视频描述方法存在的对视频较长、在视频场景切换频繁情况下视觉特征提取能力不足或关键性片段捕获能力不足等视频描述不佳的问题, 提出一种基于超级帧切割长视频的视频字幕方法。首先, 提出超级帧提取算法, 计算关键视频时间占比率以满足视频浏览时长限制, 缩短视频检索时间。然后, 构建两层筛选模型以自适应提取超级帧, 过滤冗余关键帧, 执行多场景语义描述。将保留的关键帧嵌入周围帧, 利用深层网络模型以及小卷积核池化采样域获取更多的视频特征, 克服了经典视频标题方法不能直接用于处理长视频的困难。最后, 通过用长短时记忆模型代替循环神经网络解码生成视频标题, 给出视频内容的分段解释信息。在YouTube数据集视频、合成视频和监控长视频上进行测试, 采用多种机器翻译评价指标评估了该方法的性能, 均获得了不同程度的提升。实验结果表明, 该方法在应对视频场景切换频繁、视频较长等挑战时, 能够获得较好的片段描述。

       

      Abstract: In view of the problems existing video description methods based on encoding and decoding, such as long video, insufficient ability of visual feature extraction or insufficient ability to capture key clips under the condition of frequent video scene switching, a video captioning method based on superframe cutting long video was proposed. First, a superframe extraction algorithm was proposed to meet the limitation of video browsing time and shorten the retrieval time by calculating the ratio of key video time. A two-layer screening model was constructed to adaptively extract superframes, filter redundant keyframes, and perform multi-scene semantic description.The retained keyframes were embedded in the surrounding frames. The deep network model adopted small convolution kernels, aiding to the pool sampling domain to obtain more visual features of video. It overcomed the difficulty that the classical video captioning method could not directly process long videos. Finally, a long-short-term memory model was adopted to decode and generate video captions for the piecewise interpretation of video information instead of the recurrent neural network. Tested on YouTube data set videos, synthetic videos and long surveillance videos, and a variety of machine translation evaluation indexes were used to evaluate the performance of this method, which obtained different degrees of improvement. Results show that this method can obtain better description of video segments when dealing with challenges such as frequent scene switching and long video.

       

    /

    返回文章
    返回