Advancing Domain-Specific AI through Document Understanding Techniques
We investigate advanced document understanding techniques that integrate structural parsing, semantic representation, and multimodal reasoning. Our approach focuses on extracting fine-grained layout elements such as subtitles, text blocks, tables, figures, and captions by assigning region-level roles and spatial metadata (e.g., bounding boxes and page indices).
We model documents as structured graphs, where hierarchical and relational dependencies (e.g., contains, describes, has_heading) are explicitly encoded to capture the logical organization of content. Building upon this representation, we construct evidence subgraphs that connect semantically related regions across text, tables, and figures, enabling coherent cross-modal reasoning.
In parallel, we develop region-aware chunking strategies that complement conventional vector-based text chunking by preserving layout and contextual boundaries. This allows more accurate retrieval and reasoning in downstream tasks such as RAG and multimodal QA.
Furthermore, we incorporate large language models to interpret structured document graphs and generate human-readable explanations, while multi-agent architectures decompose complex document reasoning into modular subtasks.
Through this framework, we aim to advance document intelligence systems capable of understanding not only textual content but also structural and visual semantics, supporting reliable knowledge extraction from complex scientific and technical documents.