Visual Content Interpretation Method for Superframe Cutting of Long Videos
-
Graphical Abstract
-
Abstract
In view of the problems existing video description methods based on encoding and decoding, such as long video, insufficient ability of visual feature extraction or insufficient ability to capture key clips under the condition of frequent video scene switching, a video captioning method based on superframe cutting long video was proposed. First, a superframe extraction algorithm was proposed to meet the limitation of video browsing time and shorten the retrieval time by calculating the ratio of key video time. A two-layer screening model was constructed to adaptively extract superframes, filter redundant keyframes, and perform multi-scene semantic description.The retained keyframes were embedded in the surrounding frames. The deep network model adopted small convolution kernels, aiding to the pool sampling domain to obtain more visual features of video. It overcomed the difficulty that the classical video captioning method could not directly process long videos. Finally, a long-short-term memory model was adopted to decode and generate video captions for the piecewise interpretation of video information instead of the recurrent neural network. Tested on YouTube data set videos, synthetic videos and long surveillance videos, and a variety of machine translation evaluation indexes were used to evaluate the performance of this method, which obtained different degrees of improvement. Results show that this method can obtain better description of video segments when dealing with challenges such as frequent scene switching and long video.
-
-