Abstract
Current image-text matching methods implicitly align visual-semantic segments within images, and employ cross-modal attention mechanisms to discover fine-grained cross-modal semantic correspondences. Although region-word pairs constitute local matches across modalities, they may lead to inaccurate measurements of relevance when viewed from a global perspective of image-text relationships. Additionally, cross-modal attention mechanisms may introduce redundant or irrelevant region-word alignments, which can reduce retrieval accuracy and limit efficiency. To address these challenges, we propose a Dual perception Attention and local-global Similarity Fusion framework(DASF). Specifically, We combine two types of similarity matching, global and local, to establish a more accurate correspondence between images and text by simultaneously considering global semantics and local details during the matching process. Simultaneously, we integrate dual-perception attention mechanisms to learn the relationship between images and text, utilizing attention polarity to determine the degree of matching and better consider contextual and semantic information, thereby reducing interference from irrelevant regions. Extensive experiments on two benchmark datasets, Flickr30K and MSCOCO, demonstrate the superior effectiveness of our DASF, achieving state-of-the-art performance.