Research Engineer: Information extraction, Text Recognition in Historical Document Collections

Context

LITIS

LITIS (Laboratoire d’Informatique, Traitement de l’information et des Systèmes) is a research laboratory associated to the University of Rouen Normandie, Le Havre Normandie Normandie, and School of Engineering INSA Rouen Normandie. Research at LITIS is organized around 7 research teams which contribute to 3 main application domains: Access to Information, Biomedical Information Processing, Ambient Intelligence. LITIS currently includes 90 faculty staff members, 50 PhD students, 20 PostDoc and Research Engineers. The Machine Learning team of LITIS is developing research in modeling unstructured data (signals, images, text, etc...) with machine learning algorithms and statistical models. For more than two decades it has contributed to the development of reading systems and document image analysis for various applications such as postal automation, business document exchange, digital libraries, etc...

The EXO-POPP project

Optical Extraction of Handwritten Named Entities for Marriage Certificates for the Population of Paris (1880–1940)

Thanks to a collaboration between specialists in machine learning and historians, the EXO-POPP project will develop a database of 300,000 marriage certificates from Paris and its suburbs between 1880 and 1940. These marriage certificates provide a wealth of information about the bride and groom, their parents, and their marriage witnesses, that will be analyzed from a host of new angles made possible by the new dataset. These studies of marriage, divorce, kinship, and social networks covering a span 60 years will also intersect with transversal issues such as gender, class, and origin. The geolocation of data will provide a rare opportunity to work on places and relocations within the city, and linkage with two other databases will make it possible to follow people from birth to death.

Building such a database by hand would take at least 50,000 hours of work. But, thanks to the recent developments in deep learning and machine learning, it is now possible to build huge databases with automated reading systems including handwriting recognition and natural language understanding. Indeed, because of these recent advances, optical printed named entity recognition (OP-NER) is now performing very well. On the other hand, while handwriting recognition by machine has become a reality, also thanks to deep learning, optical handwritten named entity recognition (OH-NER) has not received much attention. OH-NER is expected to achieve promising results on handwritten marriage certificates dating from 1880 to 1923. This project research questions will focus on the best strategies for word disambiguation for handwritten named entity recognition. We will explore end-to-end deep learning architectures for OH-NER, writer adaptation of the recognition system, and named entity disambiguation by exploiting the French mortality database (INSEE) and the French POPP database. An additional benefit of this study is that a unique and very large dataset of handwritten material for named entity recognition will be built.

Description

Missions

The research engineer will be in charge of the development of a processing pipeline dedicated to optical printed named entity recognition (OP-NER). He will closely collaborate with a Ph.D. student in charge of Handwritten Named Entity Recognition (OH-NER).

OP-NER is the project’s easiest task and will benefit from the latest results achieved by the LITIS team on similar problems on financial yearbooks. Images are first processed to extract every text information. This will be achieved with the DAN architecture designed by LITIS which is a deep- learning-based OCR (https://arxiv.org/abs/2203.12273). The research engineer will be in charge of this OCR task. A benchmark of DAN against available OCR software such as Tesseract and EasyOCR will also be conducted. Then the textual transcriptions will be processed for named entity extraction and recognition. Named entity recognition is a well-defined task in the natural language processing community. In the EXO-POPP context however, we need to define each entity to be extracted more precisely to make a clear distinction between the different people occurring in the text. For example, we need to distinguish between wife and husband names, and similarly for the parents of the husband and of the wife, and so on for the witnesses, children, etc. An estimation of around 135 categories has been established. The TAG definition was made by LITIS as well as a first training dataset. Manually tagging the transcriptions has been made possible through the PIVAN web-based collaborative interface (https://litis-exopopp.univ-rouen.fr/collection/12 ). This platform provides in one single web interface a document image viewer, viewing and editing of OCR results and text tagging facilities for NER. PIVAN eases the annotation efforts of the H&SS trainees and allows for building the large, annotated datasets required for machine learning algorithms to run optimally. The research engineer will oversee datasets generation and curation as per the requirement of the EXO-POPP NER task, including the handwritten datasets.

The named entity recognition task will be based on a state-of-the-art machine learning approach. We have started some experimentations with the well-known FLAIR NER library (https://github.com/flairNLP/flair). We plan to continue developing and tuning the EXO-POPP named entity recognition module using this library. The research engineer will oversee this task entirely.

Tasks

Tuning PIVAN for the OCR task. A benchmark of DAN with the available OCR technologies such as Tesseract and EasyOCR will also be conducted.
Datasets generation and curation as per the requirement of the EXO-POPP NER task, including the handwritten datasets.
NER module development

Deliverables:

Transcription of the typescript corpus
Named entities extracted from the typescript corpus

Skills:

General software development and engineering, Python
Machine Learning, Computer vision, Natural Language Processing
Ability to work in a team, curious and rigorous spirit
Knowledge in web-based programming is a plus

Fiche

Fiche de poste

How to apply ?

Research Engineer

Time commitment: Full-time

Duration of the contract: September 1st 2022 – 31st August 2023

Contact: Prof. Thierry Paquet, Thierry.Paquet@univ-rouen.fr

Indicative salary: €24 000 annual net salary, plus French social security benefits

Location: LITIS, Campus du Madrillet, Faculty of science, Saint Etienne du Rouvray, France