MTG-Link: filling gaps in draft genome assemblies with linked read data - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Poster De Conférence Année : 2020

MTG-Link: filling gaps in draft genome assemblies with linked read data

Résumé

Current advancements of both second and third generation sequencing technologies contribute to the improvement of the assembly of most genomes. However, complete and accurate reconstruction of large non-model organism genomes remains challenging. In particular, the scaffolding step orders and orients contigs but generates undefined sequences between them, called gaps. Linked read technologies, such as the 10X Genomics Chromium platform, have a great potential for filling the gaps in draft genomes as they provide long-range information while maintaining the power and accuracy of short-read sequencing [1][2]. With these technologies, reads that have been sequenced from the same long DNA molecule (around 30-50 Kb) can be identified thanks to a small barcode sequence. Several tools have been developed for gap-filling with short or long read data [3][4], but to our knowledge, none uses the long-range information of the linked read data. Here, we present MTG-Link, a novel gap-filling tool dedicated to linked read data generated by 10X Genomics Chromium technology. MTG-Link is a Python pipeline combining the local assembly tool MindTheGap [5] and an efficient read subsampling based on the barcode information. For each gap, it extracts the linked reads whose barcode is observed in the gap flanking sequences, and assembles them into contigs by traversing their de Bruijn graph. MTG-Link automatically tests different parameter values for gap-filling, in both forward and reverse orientations, and produces for each, whenever it is possible, a sequence assembly. After automatic qualitative evaluation of the best sequence assembly, it returns a GFA file, containing the gap-filled sequences of each gap. In order to speed up the process, MTG-Link uses a trivial parallelization scheme by giving each gap to a separate thread. We validated our approach on a set of simulated gaps from real datasets with various genome complexities, and showed that the read subsampling step of MTG-Link enables to get better gap assemblies in less CPU time than using MindTheGap on its own. We then applied MTG-Link on several individual genomes of a mimetic butterfly (Heliconius numata), where it significantly improved the contiguity of a 1.3 Mb locus of biological interest. MTG-Link is freely available at https://github.com/anne-gcd/MTG-Link.
JOBIM2020_AbstractPoster_AnneGUICHARD.pdf (115.82 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03073966 , version 1 (16-12-2020)

Identifiants

  • HAL Id : hal-03073966 , version 1

Citer

Anne Guichard, Fabrice Legeai, Arthur Le Bars, Paul Yann Jay, Mathieu Joron, et al.. MTG-Link: filling gaps in draft genome assemblies with linked read data. JOBIM 2020 - Journées Ouvertes Biologie, Informatique et Mathématiques, Jun 2020, Montpellier, France. ⟨hal-03073966⟩
149 Consultations
23 Téléchargements

Partager

Gmail Facebook X LinkedIn More