Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • People
  • Outputs
  • Organizations
  • Expertise & Skills
  1. Outputs

Spectrum preserving tilings enable sparse and modular reference indexing

Conference Paper
Publication Date:
2023
abstract:
The reference indexing problem for -mers is to pre-process a collection of reference genomic sequences so that the position of all occurrences of any queried -mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics. In this work, we introduce the spectrum preserving tiling (SPT), a general representation of that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for -mers into: (1) a -mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index -mer sets can be used to efficiently implement the -mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the -mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique -mers in. To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool pufferfish2. When indexing over 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6 slowdown when querying -mers from a sequenced readset. Availability: pufferfish2 is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.
Iris type:
04.01 Contributo in Atti di convegno
Keywords:
Reference indexing; Spectrum preserving tilings; Minimal perfect hashing; Pufferfish2
List of contributors:
Pibiri, GIULIO ERMANNO
Handle:
https://iris.cnr.it/handle/20.500.14243/462235
Book title:
Research in Computational Molecular Biology
  • Overview

Overview

URL

https://link.springer.com/chapter/10.1007/978-3-031-29119-7_2
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)