PMID- 37669188
OWN - NLM
STAT- PubMed-not-MEDLINE
LR  - 20230913
IS  - 1941-0042 (Electronic)
IS  - 1057-7149 (Linking)
VI  - 32
DP  - 2023
TI  - Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA.
PG  - 5060-5074
LID - 10.1109/TIP.2023.3310332 [doi]
AB  - Text-based visual question answering (TextVQA) faces the significant challenge of 
      avoiding redundant relational inference. To be specific, a large number of 
      detected objects and optical character recognition (OCR) tokens result in rich 
      visual relationships. Existing works take all visual relationships into account 
      for answer prediction. However, there are three observations: (1) a single 
      subject in the images can be easily detected as multiple objects with distinct 
      bounding boxes (considered repetitive objects). The associations between these 
      repetitive objects are superfluous for answer reasoning; (2) two spatially 
      distant OCR tokens detected in the image frequently have weak semantic 
      dependencies for answer reasoning; and (3) the co-existence of nearby objects and 
      tokens may be indicative of important visual cues for predicting answers. Rather 
      than utilizing all of them for answer prediction, we make an effort to identify 
      the most important connections or eliminate redundant ones. We propose a sparse 
      spatial graph network (SSGN) that introduces a spatially aware relation pruning 
      technique to this task. As spatial factors for relation measurement, we employ 
      spatial distance, geometric dimension, overlap area, and DIoU for spatially aware 
      pruning. We consider three visual relationships for graph learning: 
      object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a 
      progressive graph learning architecture that verifies the pivotal relations in 
      the correlated object-token sparse graph, and then in the respective object-based 
      sparse graph and token-based sparse graph. Experiment results on TextVQA and 
      ST-VQA datasets demonstrate that SSGN achieves promising performances. And some 
      visualization results further demonstrate the interpretability of our method.
FAU - Zhou, Sheng
AU  - Zhou S
FAU - Guo, Dan
AU  - Guo D
FAU - Li, Jia
AU  - Li J
FAU - Yang, Xun
AU  - Yang X
FAU - Wang, Meng
AU  - Wang M
LA  - eng
PT  - Journal Article
DEP - 20230912
PL  - United States
TA  - IEEE Trans Image Process
JT  - IEEE transactions on image processing : a publication of the IEEE Signal 
      Processing Society
JID - 9886191
SB  - IM
EDAT- 2023/09/05 18:41
MHDA- 2023/09/05 18:42
CRDT- 2023/09/05 12:52
PHST- 2023/09/05 18:42 [medline]
PHST- 2023/09/05 18:41 [pubmed]
PHST- 2023/09/05 12:52 [entrez]
AID - 10.1109/TIP.2023.3310332 [doi]
PST - ppublish
SO  - IEEE Trans Image Process. 2023;32:5060-5074. doi: 10.1109/TIP.2023.3310332. Epub 
      2023 Sep 12.