Automatic Generation of Semi-structured Documents

Djedjiga Belhadj; Yolande Belaïd; Abdel Belaid

doi:10.1007/978-3-030-86159-9_13

Communication Dans Un Congrès Année : 2021

Automatic Generation of Semi-structured Documents

(1) , (1) , (1)

Djedjiga Belhadj

Fonction : Auteur
PersonId : 825383
ORCID : 0000-0003-0548-3948

Department of Natural Language Processing & Knowledge Discovery

Yolande Belaïd

Fonction : Auteur
PersonId : 1176
IdHAL : yolande-belaid
IdRef : 056980574

Department of Natural Language Processing & Knowledge Discovery

Abdel Belaid

Fonction : Auteur

Department of Natural Language Processing & Knowledge Discovery

Résumé

In this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on three formats: pdf, xml and tiff image. We add an evaluation step to choose an adequate dataset for the learning process and avoid the overfitting. We can easily extend the actual implementation (https://github.com/fairandsmart/facogen) to other SSD types. We use this generator results to experiment an information extraction system from SSDs.

Mots clés

Semi-structured documents Random variables Automatic generation

Domaines

Interface homme-machine [cs.HC] Traitement du texte et du document

Yolande Belaid : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03647644

Soumis le : mercredi 20 avril 2022-16:55:24

Dernière modification le : lundi 11 septembre 2023-17:41:19

Dates et versions

hal-03647644 , version 1 (20-04-2022)

Identifiants

HAL Id : hal-03647644 , version 1
DOI : 10.1007/978-3-030-86159-9_13

Citer

Djedjiga Belhadj, Yolande Belaïd, Abdel Belaid. Automatic Generation of Semi-structured Documents. ICDAR Workshop on Open Services and Tools for Document Analysis (ICDAR-OST), Sep 2021, Lausanne, Switzerland. pp.191-205, ⟨10.1007/978-3-030-86159-9_13⟩. ⟨hal-03647644⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE LORIA LORIA-NLPKD

35 Consultations

0 Téléchargements

Automatic Generation of Semi-structured Documents

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager