Publication Date:
2014
abstract:
Background: Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short
DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA
assembly, gene expression profiling, mutation analysis.
Methods: We propose a method to evaluate the similarity between reads. This method does not rely on the
alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed
dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two
alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most
correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on
the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal
distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance.
The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces
cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors
cross-validated over different samples.
Results: We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful
read-to-read distance measure and performs better than the more time consuming distances based on alignment.
Conclusions: Alignment-free distances may be used effectively for reads comparison, and may provide a significant
speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).
Iris type:
01.01 Articolo in rivista
Keywords:
Sequence analysis; Next generation sequencing; Alignment-free
List of contributors:
Weitschek, Emanuel; DE COLA, Maria; Fiscon, Giulia; Bertolazzi, Paola; Felici, Giovanni; Santoni, Daniele
Published in: