Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012 - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2012

Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Résumé

The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.
Fichier principal
Vignette du fichier
clef-ip-2012-flow.pdf (1.1 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00728779 , version 1 (06-09-2012)

Identifiants

  • HAL Id : hal-00728779 , version 1

Citer

Andrew Thean, Jean-Marc Deltorn, Patrice Lopez, Laurent Romary. Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012. CLEF 2012, Sep 2012, Roma, Italy. ⟨hal-00728779⟩

Collections

INRIA INRIA2
359 Consultations
526 Téléchargements

Partager

Gmail Facebook X LinkedIn More