Abstract:
To solve problems of insufficient detection of dynamic gesture key frames and hand contour features in two-stream fusion network, a dynamic gesture recognition method was proposed in this paper based on the fusion of spatial-temporal features and channel attention. First, the efficient channel attention (ECA) was introduced into the two-stream fusion network to enhance the attention of key frames of gestures, the spatial convolutional network and the temporal convolutional network of two-stream were used to extract spatial and temporal features of dynamic gestures. Second, the gesture frame with the highest attention in the spatial network was selected by ECA, and single shot multibox detector (SSD) was used to extract the hand contour features. Finally, hand contour features were integrated with body posture features and temporal features were extracted from two-stream to recognize gestures. The method proposed in this paper was verified on Chalearn 2013 multi-modal sign language recognition dataset, with an accuracy rate of 66.23%. Compared with the previous two-stream methods which only RGB information from this dataset was adopted, it achieves a better gesture recognition effect.