Abstract:
Collaborative analysis and processing of cross-modal data are always difficult and hot topics in the field of modern artificial intelligence. The main challenge is the semantic and heterogeneous gap of cross-modal data. Recently, with the rapid development of deep learning theory and technology, algorithms based on deep learning have made great progress in the field of image and text processing, and then the research topic of visual question answering (VQA) has emerged. VQA system uses visual information and text questions as input to get corresponding answers. The core of the system is to understand and process visual and text information cooperatively. Therefore, VQA methods were reviewed in detail. According to the principle of methods, the existing VQA methods were divided into three categories including data fusion, cross-modal attention and knowledge reasoning. The latest development of VQA methods was comprehensively summarized and analyzed, the commonly used VQA data sets were introduced and prospects for future research direction were suggested.