Context and main objectives
The digital transformation of libraries, which has been based on OCR (Optical Character Recognition) technology for more than 20 years, faces some limitations both in terms of quality, due to the diversity of the collections and the limitations of OCR technology, and in terms of added value due to a lack of structuring and high-level indexing. Named entity extraction is still little used because it relies on language processing technologies, which were not very adaptable until recently. More generally, the semantic indexing of collections is underdeveloped and integrated with metadata. We propose to develop multimodal models (text + image) for the extraction of information from collections of digitized documents in large libraries. The literature shows that work in this direction is still underdeveloped, and that it is mainly aimed at processing commercial documents (invoices etc…).
The proposed project aims to disrupt the traditional sequential document processing workflow by combining Vision models and Large Language Models (LLM) to provide a more streamlined and efficient approach. The standard two-stages architectures based on OCR + NER (Optical Character Recognition, Named Entity Recognition) are now giving way to end-to-end multimodal approaches known as Document Understanding, which are more versatile and easily adaptable to new corpora, making it easier and more cost-effective to set up and run document processing projects. As a result, this accessible, user-friendly approach will democratize access to advanced AI technologies for a wider range of institutions, contributing to the evolution of the technology value chain in the Libraries, Archives and Museums (LAM) sector and opening up new opportunities for research and discovery.
The proposed work program funded by the FINLAM project (Foundation INtegrated models for Libraries Archives and Museum, ANR 2023), relies on the expertise of LITIS to study the most relevant multimodal architectures to integrate the language knowledge conveyed by the large language models developed recently and to study the modalities of specialization/adaptation of these models in conjunction with the learning of a generic optical encoder, benefiting from the annotated collections available at the French national library (Bibliothèque nationale de France - BnF). User interaction will be considered according to different scenarii of closed and open queries.
State of the art overview
In 2022, the first end-to-end models integrating OCR and named entity extraction have been proposed for document understanding tasks. The DONUT (DOcumeNt Understanding Transformer) model  proposed by NaverLab and Google Asia performs in a single stage the analysis of the layout to detect writing areas, proceeds to their recognition using a lexicon of subwords and finally detects the named entities using specific TAGs, and a strong external language model (BART). DONUT is pre-trained on synthetic documents whose associated ground truth is a sequence of subwords and TAGs. No segmentation ground truth is used. Document Understanding is thus reduced to a task of learning a tagged language provided that the system has vision capabilities to build high-level visual representations. A similar approach has been proposed by Adobe in autumn 2022 with DESSURT . The Pix2struct  architecture proposed by Google USA also falls into this category of integrated systems for document understanding.
In the year 2022, The LITIS Machine Learning team proposed two models for digitized documents that integrate the layout analysis stage. The VAN (Vertical Attention Network) model is capable of learning to recognize paragraphs of handwritten text  and outperforms the state of the art. The DAN (Document Attention Network) model  can learn the layout and the handwriting of a handwritten document end-to-end. The DAN model is trained on synthetic printed documents before being specialized on handwritten documents without using any physical segmentation information. It outputs the recognized texts enriched with some layout TAGs. The DONUT and DAN models proceed according to the same visual attention mechanisms thanks to a transformer-type network, they are pre-trained on synthetic documents and use only tagged transcriptions of texts during training. DAN is specialised in text recognition, while DONUT is specialized in named entity extraction.
Orientation of research
1. Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut : OCR-free Document understanding transformer, ECCV, pp 498–517, 2022, http://arxiv.org/abs/2111.15664.
2. Brian Davis, Bryan Morse, et al., End-to-end Document Recognition and Understanding with Dessurt, 2022, https://arxiv.org/abs/2203.16618.
3. K Lee, Mandar Joshi, et al. , Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding, 2022, https://arxiv.org/abs/2210.03347.
4. Denis Coquenet, Clément Chatelain, and Thierry Paquet, End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network, IEEE-PAMI , Vol.45 n° 1 , pp. 508-524, jan. 2023. http://doi.org/10.1109/TPAMI.2022.3144899, pre-print https://arxiv.org/abs/2012.03868
5. D. Coquenet, C. Chatelain and T. Paquet, « DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition, » in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, http://doi.org/10.1109/TPAMI.2023.3235826 , pre-print, https://arxiv.org/abs/2203.12273
6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019, https://arxiv.org/abs/1810.04805
7. Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Phil Wang, and Samuel Weinbach. 2021. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch
8. Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar, "DocVQA: A Dataset for VQA on Document Images", arXiv:2007.00398, WACV 2021.
9. Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C.V. Jawahar, "Document Visual Question Answering Challenge 2020", arXiv:2008.08899, DAS 2020.