基于双分支多头注意力的场景图生成方法
Scene Graph Generation Method Based on Dual-stream Multi-head Attention
-
摘要: 针对已有场景图生成模型获取上下文信息有限的问题, 提出一种有效的上下文融合模块, 即双分支多头注意力(dual-stream multi-head attention, DMA)模块, 并将DMA分别用于物体分类阶段和关系分类阶段, 基于此提出基于双分支多头注意力的场景图生成网络(dual-stream multi-head attention-based scene graph generation network, DMA-Net)。该网络由目标检测、物体语义解析和关系语义解析3个模块组成。首先, 通过目标检测模块定位图像中的物体并提取物体特征; 其次, 使用物体语义解析模块中的节点双分支多头注意力(object dual-stream multi-head attention, O-DMA)获取融合了节点上下文的特征, 该特征经过物体语义解码器获得物体类别标签; 最后, 通过关系语义解析模块中的边双分支多头注意力(relationship dual-stream multi-head attention, R-DMA)输出融合了边上下文的特征, 该特征经过关系语义解码器输出关系类别标签。在公开的视觉基因组(visual genome, VG)数据集上分别计算了DMA-Net针对场景图检测、场景图分类和谓词分类3个子任务的图约束召回率和无图约束召回率, 并与主流的场景图生成方法进行比较。实验结果表明, 所提出的方法能够充分挖掘场景中的上下文信息, 基于上下文增强的特征表示有效提升了场景图生成任务的精度。Abstract: To address the issue that the contextual information obtained by existing scene graph generation methods is limited, an effective context fusion module was proposed, which is the dual-stream multi-head attention module (DMA). By using DMA for object classification and relationship classification, the dual-stream multi-head attention-based scene graph generation network (DMA-Net) was suggested. The proposed method consists of object detection, object semantic parsing, and relationship semantic parsing. First, the object detection module located the objects in the image and extracted the features of the objects. Second, the object dual-stream multi-head attention (O-DMA) in object semantic parsing module was used to obtain the features fused with node contexts, which were decoded by the object semantic decoder to obtain the object labels. Finally, the features fused with edge contexts were output by the relationship dual-stream multi-head attention (R-DMA) in relationship semantic parsing module and decoded by the relationship semantic decoder to get the relationship labels. Comparisons with the proposed method and mainstream scene graph generation methods were conducted on the publicly available visual genome (VG) dataset, the graph constraint recall and no graph constraint recall of DMA-Net for three subtasks including scene graph detection, scene graph classification, and predicate classification were computed for each method. Results show that the proposed method can fully exploit the contextual information in the scene, which enhances the representation capability of features and improves the accuracy of the scene graph generation task.