Highlights from the 11th ISCB Student Council Symposium 2015. Dublin, Ireland. 10 July 2015.
Articolo
Data di Pubblicazione:
2016
Abstract:
A novel feature selection method to extract multiple adjacent solutions for viral genomic sequences classification
Background
Leveraging improvements of next generation technologies, genome sequencing of several samples in different conditions led to an exponential growth of biological sequences. However, these collections are not easily treatable by biologists to obtain a thorough data characterization and require a high cost-time investment. Therefore, computing strategies and specifically automatic knowledge extraction methods that optimize the analysis focusing on what data are meaningful and should be sequenced are essential [1].
Methods
Here, we present a new feature-selection algorithm based on mixed integer programming methods [2] able to extract multiple and adjacent solutions for supervised learning problems applied to biological data. We focus on those problems where the relative position of a feature (i.e., nucleotide locus) is relevant. In particular, we aim to find sets of distinctive features, which are as close as possible to each other and which appear with the same required characteristics. Our algorithm adopts a fast and effective method to evaluate the quality of the extracted sets of features and it has been successfully integrated in a rule-based classification framework [3].
Results
Our algorithm has been applied to three viral datasets (i.e., Rhino-, Influenza-, Polyomaviruses [4-6]) and enables to extract all the alternative solutions of virus specimen to species assignments, by identifying portions of sequence that are discriminant, compact, and as shorter as possible.
To conclude, we succeeded in extracting a wide set of equivalent classification rules, focusing on short regions of sequences with high reliability and low computational time, in order to provide the biologists with short and highly informative genome parts to be sequenced, as well as a powerful instrument both scientifically and diagnostically, e.g., for automatic virus detection.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
multiple sequences virus; bioinformatics; virus
Elenco autori:
Weitschek, Emanuel; Fiscon, Giulia; Bertolazzi, Paola; Felici, Giovanni
Link alla scheda completa:
Pubblicato in: