Skip to yearly menu bar Skip to main content


Poster

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

Mengjun Cheng · Chengquan Zhang · Chang Liu · Yuke Li · Bohan Li · Kun Yao · Xiawu Zheng · Rongrong Ji · Jie Chen

Strong blind review: This paper was not made available on public preprint services during the review process Strong Double Blind
[ ]
Tue 1 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

Current methodologies have achieved notable success in the closed-set visual information extraction task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions including NER methods and LLM-based methods fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating the layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model's understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20x faster than the QwenVL model and achieves an improvement of about 24.3% for the F-score.

Live content is unavailable. Log in and register to view live content