JPMorgan Chase Launches DocLLM for Multimodal Document Understanding

2024-01-04

Morgan Stanley has launched DocLLM, a generative language model designed for multimodal document understanding. DocLLM is a lightweight extension of large language models (LLMs) that stands out for its ability to analyze enterprise documents with complex semantics, including tables, invoices, reports, and contracts, which have complex semantics at the intersection of text and spatial patterns.

Unlike existing multimodal LLMs, DocLLM strategically avoids expensive image encoders and focuses on incorporating spatial layout structures using bounding box information. The model introduces a decoupled spatial attention mechanism by decomposing the attention mechanism in classical transformers into a set of independent matrices.

DocLLM tackles the challenges of irregular layouts and heterogeneous content in visual documents by adopting a pretraining objective that focuses on learning to fill text segments.

The model features a decoupled spatial attention mechanism that facilitates cross-alignment between text and layout modalities, as well as a pretraining objective that excels at handling irregular layouts effectively.

To pretrain DocLLM, data was collected from two main sources: the IIT-CDIP Test Collection 1.0 and DocBank. The former contains over 5 million documents related to legal litigation against the tobacco industry in the 1990s, while the latter consists of 500,000 documents, each with a different layout.

Extensive evaluations on various document intelligence tasks have demonstrated that DocLLM outperforms existing state-of-the-art LLMs. The model surpasses equivalent models on 14 out of 16 known datasets and exhibits strong generalization capabilities on 4 out of 5 unseen datasets.

Looking ahead, Morgan Stanley has expressed its commitment to further enhancing the capabilities of DocLLM by incorporating visual elements in a lightweight manner.