Skip to Main Content (Press Enter)

Logo CNR
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze

UNI-FIND
Logo CNR

|

UNI-FIND

cnr.it
  • ×
  • Home
  • Persone
  • Pubblicazioni
  • Strutture
  • Competenze
  1. Pubblicazioni

Probabilistic topic modeling for the analysis and classification of genomic sequences

Articolo
Data di Pubblicazione:
2015
Abstract:
BACKGROUND: Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. METHODS: The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. RESULTS AND CONCLUSIONS: We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.
Tipologia CRIS:
01.01 Articolo in rivista
Keywords:
Probabilistic topic model; ultra short sequence classification; LDA
Elenco autori:
Rizzo, Riccardo; Urso, Alfonso; Fiannaca, Antonino; LA ROSA, Massimo
Autori di Ateneo:
FIANNACA ANTONINO
LA ROSA MASSIMO
RIZZO RICCARDO
URSO ALFONSO
Link alla scheda completa:
https://iris.cnr.it/handle/20.500.14243/300849
Pubblicato in:
BMC BIOINFORMATICS
Journal
  • Dati Generali

Dati Generali

URL

http://www.biomedcentral.com/1471-2105/16/S6/S2
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0 | Sorgente dati: PREPROD (Ribaltamento disabilitato)