Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Andrew Thean; Jean-Marc Deltorn; Patrice Lopez; Laurent Romary

Communication Dans Un Congrès Année : 2012

Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

(1, 2) , (1, 2) , (1, 2) , (1, 2)

1
2

Andrew Thean

Fonction : Auteur

Institut für Deutsche Sprache und Linguistik

Inria Saclay - Ile de France

Jean-Marc Deltorn

Fonction : Auteur

Institut für Deutsche Sprache und Linguistik

Inria Saclay - Ile de France

Patrice Lopez

Fonction : Auteur
PersonId : 2984
IdHAL : patricelopez
ORCID : 0000-0002-9959-9441
IdRef : 157929930

Institut für Deutsche Sprache und Linguistik

Inria Saclay - Ile de France

Laurent Romary

Fonction : Auteur
PersonId : 307
IdHAL : laurentromary
ORCID : 0000-0002-0756-0508
IdRef : 060702494

Institut für Deutsche Sprache und Linguistik

Inria Saclay - Ile de France

Résumé

The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.

Domaines

Informatique et langage [cs.CL]

Fichier principal

clef-ip-2012-flow.pdf (1.1 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Laurent Romary : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00728779

Soumis le : jeudi 6 septembre 2012-15:55:11

Dernière modification le : vendredi 22 décembre 2023-16:00:04

Archivage à long terme le : vendredi 16 décembre 2016-11:11:04

Dates et versions

hal-00728779 , version 1 (06-09-2012)

Identifiants

HAL Id : hal-00728779 , version 1

Citer

Andrew Thean, Jean-Marc Deltorn, Patrice Lopez, Laurent Romary. Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012. CLEF 2012, Sep 2012, Roma, Italy. ⟨hal-00728779⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2

359 Consultations

526 Téléchargements

Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager