Automatic Generation of Semi-structured Documents
Résumé
In this paper, we present a generator of semi structured documents (SSDs). This generator can provide samples of administrative documents that are useful for learning information extraction systems. It can also take care of the document annotation operation which is generally difficult to do and time consuming. We propose a general structure for SSDs and we prove that it perfectly works on three SSD types: invoices, payslips and receipts. Both the content and the layout are managed by random variables allowing them to be varied and to obtain different samples. These documents have some sort of similarity that gives them a common global model with particularities for each of them. The generator outputs the documents on three formats: pdf, xml and tiff image. We add an evaluation step to choose an adequate dataset for the learning process and avoid the overfitting. We can easily extend the actual implementation (https://github.com/fairandsmart/facogen) to other SSD types. We use this generator results to experiment an information extraction system from SSDs.