Apprentissage par transfert pour l’extraction de relations pharmacogénomiques à partir de textes

Walid Hafiane

Résumé

The extraction of relationships between named entities is a required task for text mining, particularly in the biomedical field, this task allows to synthesize the available knowledge in a lot of publications. Recently, deep learning approaches have significantly improved performance in relation extraction task. However, the biomedical data complexity is a major challenge facing this approach. Current architectures do not fully perform in particular relation extraction (i.e., pharmacogenomics, protein-molecule). In this work, we propose architectures that bring in a significant improvement to this task. On the other hand, obtaining a large amount of annotated data in the biomedical field is challenging and expensive, in attempt to cover this lack of data transfer learning is used. In this context, we have opted for two transfer learning strategies: frozen and fine-tuning. Our BERT-CNN-segmentation architecture with the fine-tuning strategy achieve the new state-of-the-art results on two benchmark biomedical corpora with 32.77 % absolute improvement in F- macro on PGxCorpus and 1.73% absolute improvement in F-micro on the ChemProt. These results show the usefulness of transfer learning and the improved performance of the BERT transformers through the exploitation of local latent information in the representation vectors by reinforcing this information with the structural information resulting from sentence segmentation.

L’extraction des relations entre les entités nommées est une tâche primordiale pour la fouille de textes, notamment dans le domaine biomédicale où cette tâche permet de synthétiser les connaissances disponibles dans les nombreuses publications du domaine. Récemment, les approches de l’apprentissage profond ont considérablement amélioré les performances d’extraction de relation. Cependant, la complexité des données biomédicales met en défi ce type d’approche. Les architectures actuelles ne répondent pas totalement à la problématique de l’extraction des relations particulières (i.e., pharmacogénomiques, protéine-molécule). Dans ce travail, nous proposons des architectures qui apportent une amélioration significative à cette problématique. D’autre part, l’obtention de nombreuses données annotées dans le domaine biomédical est coûteuse, pour pallier ce manque de données nous avons utilisé des méthodes d’apprentissage par transfert. Dans ce cadre, nous avons opté pour deux stratégies d’apprentissage par transfert frozen et fine-tuning. Notre architecture BERT-CNN-Segmentation avec la stratégie fine-tuning dépasse l’état de l’art dans les deux corpus biomédicaux références avec 32.77 % amélioration absolue en F-macro sur PGxCorpus et 1.73% amélioration absolue en F-micro sur le corpus ChemProt. Ces résultats montrent l’utilité de l’apprentissage par transfert et l’amélioration des performances du transformateur BERT à travers l’exploitation de l’information locale latente dans les vecteurs de représentation en renforçant cette information par l’information structurale issue de la segmentation des phrases.

Apprentissage par transfert pour l’extraction de relations pharmacogénomiques à partir de textes

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager