Lossless filter for multiple repetitions with Hamming distance

Pierre Peterlongo; Nadia Pisanti; Frédéric Boyer; Alair Pereira Do Lago; Marie-France Sagot

doi:10.1016/j.jda.2007.03.003

Article Dans Une Revue Journal of Discrete Algorithms Année : 2008

Lossless filter for multiple repetitions with Hamming distance

(1) , (2) , (3) , (4) , (3)

1
2
3
4

Pierre Peterlongo

Fonction : Auteur correspondant
PersonId : 171998
IdHAL : pierre-peterlongo
ORCID : 0000-0003-0776-6407
IdRef : 12482062X

Connectez-vous pour contacter l'auteur

Biological systems and models, bioinformatics and sequences

Nadia Pisanti

Fonction : Auteur
PersonId : 843474

Department of Computer Science [Pisa]

Frédéric Boyer

Fonction : Auteur
PersonId : 843475

Computer science and genomics

Alair Pereira Do Lago

Fonction : Auteur

Departamento de Ciência da Computação [São Paulo]

Marie-France Sagot

Fonction : Auteur
PersonId : 170068
IdHAL : marie-france-sagot
IdRef : 103537562

Computer science and genomics

Résumé

Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous ﬁltration and indexing techniques have been created in order to speed up the solution of the problem. However, previous ﬁlters were made for speeding up pattern matching, or for ﬁnding repetitions between two strings or o ccurring twice in the same string. In this paper, we present an algorithm called Nimbus for ﬁltering strings prior to ﬁnding repetitions o ccurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the ﬁlter can be very efficient: preprocessing with Nimbus a data set where one wants to ﬁnd functional elements using a multiple lo cal alignment to ol such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.

Mots clés

Approximate repetitions k-Factors Multiple local alignment Bi-factors Bi-factor array

Domaines

Bio-informatique [q-bio.QM] Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

jdaNimbus2.pdf (250.46 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Pierre Peterlongo : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00179731

Soumis le : mardi 23 octobre 2007-10:43:14

Dernière modification le : mardi 23 janvier 2024-11:38:04

Archivage à long terme le : jeudi 27 septembre 2012-13:10:24

Dates et versions

inria-00179731 , version 1 (23-10-2007)

Identifiants

HAL Id : inria-00179731 , version 1
DOI : 10.1016/j.jda.2007.03.003

Citer

Pierre Peterlongo, Nadia Pisanti, Frédéric Boyer, Alair Pereira Do Lago, Marie-France Sagot. Lossless filter for multiple repetitions with Hamming distance. Journal of Discrete Algorithms, 2008, 6 (3), pp.497-509. ⟨10.1016/j.jda.2007.03.003⟩. ⟨inria-00179731⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

EC-PARIS UNIV-RENNES1 CNRS INRIA UNIV-LYON1 INSA-RENNES IRISA IRISA-D7 BIOENVIS INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES LBBE INSA-GROUPE UDL ANR UR1-MATH-NUM

172 Consultations

106 Téléchargements

Lossless filter for multiple repetitions with Hamming distance

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager