基于场景-物体-方向线索补全及融合的视觉语言导航
Scene-Object-Direction Clues Completion and Fusion for Vision-and-Language Navigation
-
摘要: 针对视觉语言导航(vision-and-language navigation, VLN)模型中基于语言指令构建语言图时某些线索缺失导致所构建的语言图中存在无效节点的问题, 设计线索补全模块(clues completion module, CCM)以改善无效节点的信息表达能力, 并设计线索加权融合模块(clues-weighted fusion module, CFM)对3种线索进行差异化融合, 融合后的线索信息用于动作预测, 进而得到更加准确的动作分数以提高导航准确率。在房间到房间(room-to-room, R2R)数据集上的实验结果表明, 该方法的导航成功率(success rate, SR)和路径长度加权成功率(success rate weighted by path length, SPL)有明显提升。Abstract: To solve the problem of including invalid nodes in the constructed language graph due to lacking of some clues in the instruction in vision-and-language navigation (VLN) model, this paper designs clues completion module (CCM)to enrich the information expression ability of invalid nodes. A clues-weighted fusion module (CFM) was designed to differentially fuse three clues types, and the fused clue information was used for action prediction, obtaining a more accurate action score to improve the navigation accuracy. Experimental results on the room-to-room (R2R) dataset show that the proposed method has a significant improvement on the main metrics success rate (SR) and success rate weighted by path length (SPL) of VLN tasks.
下载: