Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills
  1. Outputs

Fulgor: a fast and compact k-mer index for large-scale matching and color queries

Academic Article
Publication Date:
2024
abstract:
The problem of sequence identification or matching--determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence--is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto--the strongest competitor in terms of index space vs. query time trade-off--Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6× faster to construct.
Iris type:
01.01 Articolo in rivista
Keywords:
k-mers; Colored compacted de Bruijn graph; Compression; Read-mapping
List of contributors:
Pibiri, GIULIO ERMANNO
Handle:
https://iris.cnr.it/handle/20.500.14243/454752
Full Text:
https://iris.cnr.it//retrieve/handle/20.500.14243/454752/182003/prod_492117-doc_205297.pdf
Published in:
ALGORITHMS FOR MOLECULAR BIOLOGY
Journal
  • Overview

Overview

URL

https://almob.biomedcentral.com/articles/10.1186/s13015-024-00251-9
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)