Automatic Generation of Semi-structured Documents - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Automatic Generation of Semi-structured Documents

Résumé

In this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on three formats: pdf, xml and tiff image. We add an evaluation step to choose an adequate dataset for the learning process and avoid the overfitting. We can easily extend the actual implementation (https://github.com/fairandsmart/facogen) to other SSD types. We use this generator results to experiment an information extraction system from SSDs.
Fichier non déposé

Dates et versions

hal-03647644 , version 1 (20-04-2022)

Identifiants

Citer

Djedjiga Belhadj, Yolande Belaïd, Abdel Belaid. Automatic Generation of Semi-structured Documents. ICDAR Workshop on Open Services and Tools for Document Analysis (ICDAR-OST), Sep 2021, Lausanne, Switzerland. pp.191-205, ⟨10.1007/978-3-030-86159-9_13⟩. ⟨hal-03647644⟩
35 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More