Data di Pubblicazione:
2014
Abstract:
Motivation: Protein architectures form a complex multilayered hierarchy.The primary linear sequence of amino acids residues arranges itself in 3-dimensional space so to form localstructures (secondary and super-secondary structures, and extends up to fully functional folded proteins(tertiary and quaternary structures) with their functional characterization.For a majority of proteins only the primary AA sequence is known reliably,while the most valuable characterization in structural ond/or functional termsis routinely attained with the use of prediction tools that try to find matching homologous proteinswithin databases of validated structural/functional hierarchies (e.g. SCOP, CATH).As remarked in [Simossis and Heringa 2006], at the moment no systematic analysis has been done on howincorporating repetitive features of the primary sequence might help in improving alignment quality ofhomologous proteins (and protein families) matching.Here we report initial findings in the direction of repeatome-based profiling of protein families withthe aim of improving current alignment/matching technologies and classification methods.Methods: PTRStalker [Pellegrini et al. 2012] is an algorithm designed to detect Fuzzy TR (FTR) in protein sequences(20AA alphabet). Using PTRStalker as a black-box we compute a FTR-profile for a protein P by(a) detect the set FTR(P) of FTR in P (b) compute mean of the FTR over ten random shuffling of P(c) remove from FTR(P) all TR of length smaller than the mean computed at (b).The statistically filtered FTR are then turned into a vector of features that includethe length of P, the length of all FTR after the statistical filtering in the order of appearance along the protein,and the features of the background random shuffling FTR distribution (mean and max values).This FTR-descriptor for the protein P can be used in different ways.In the next section we report good performance of this descriptor in a direct characterization ofstructured and unstructured proteins. Also we have used this descriptor together with the Euclidean metricto perform unsupervised learning (clustering) of SCOP protein families obtaining highly homogeneous clusters.As next step we plan to apply this new protein descriptor in conjunction with other descriptors(primary sequence, secondary structure, etc.) in the framework of Chung and Yona 2004 in order toimprove the prediction of distant homologies among protein families by augmenting family profiles with FTR descriptors.Results: We have tested three validated data sets.The first data set (DS1) is a collection of 92 sequences covering 54037 bps from [Walsh et al. 2012]corresponding to 18725 bps validated secondary structures (mostly solenoids).This benchmark is intended to measure the capability of PTRStalkerin detecting existing secondary structures. After statistical filteringPTRStalker returns 95 Fuzzy Tandem Repeats of which 67 overlap known SS in DS1.In terms of base counts the reported FTR cover 17544 bases of which 11594 cover known SS in DS1 (recall: 0.62, precision: 0.66).The second data set (DS2) is a collection of 105 proteins fromthe database DisProt classified as 100% disordered. The rationale of the experiment is that disordered protein shouldbe relatively free of long tandem repeats. We split the data in three groups of 35 protein each,of length range [45-110][111-208], and [>209] and in each class we tested the hypothesis that the disorderedproteins of that length class have FTR statistically equivalent from that of randomly shuffled proteins.The Wilcoxon signed rank test on the length of the longest FTR found in each of the three classesare respectively: 0.199, 0.135 and 0.008. This result implies that such unstructured proteins are indeedfree of significant FTR at least
Tipologia CRIS:
04.02 Abstract in Atti di convegno
Keywords:
Protein architectures; Biology; algorithm
Elenco autori:
Genovese, LOREDANA MARIALUISA; Pellegrini, Marco; Geraci, Filippo
Link alla scheda completa: