Functional Annotation of Proteins using Domain Embedding based Sequence Classification - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Functional Annotation of Proteins using Domain Embedding based Sequence Classification

Résumé

Due to the recent advancement in genomic sequencing technologies, the number of protein sequences in public databases is growing exponentially. The UniProt Knowledgebase (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. The May 2019 release of the Uniprot Knowledge base (UniprotKB) contains around 158 million protein sequences. For the complete exploitation of this huge knowledge base, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology terms. However, there is only about half a million sequences (UniprotKB/SwissProt) are reviewed and functionally annotated by expert curators using information extracted from the published literature and computational analyses. The manual annotation by experts are expensive, slow and insufficient to fill the gap between the annotated and unannotated protein sequences. In this paper, we present an automatic functional annotation technique using neural network based based word embedding exploiting domain and family information of proteins. Domains are the most conserved regions in protein sequences and constitute the building blocks of 3D protein structures. To do the experiment, we used fastText a , a library for learning of word embeddings and text classification developed by Facebook's AI Research lab. The experimental results show that domain embeddings perform much better than k-mer based word embeddings. a https://github.com/facebookresearch/fasttext
Fichier principal
Vignette du fichier
KDIR_Camera_Ready.pdf (177.67 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02283430 , version 1 (10-09-2019)

Identifiants

Citer

Bishnu Sarker, David W. Ritchie, Sabeur Aridhi. Functional Annotation of Proteins using Domain Embedding based Sequence Classification. KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval, Sep 2019, Vienna, Austria. pp.163-170, ⟨10.5220/0008353401630170⟩. ⟨hal-02283430⟩
295 Consultations
375 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More