Jennifer A. Smith

Non-Coding RNA Covariance Model Combination Using Mixed Primary-Secondary Structure Alignment

Tue, 01 Jan 2013 08:00:00 +0000

Covariance models are very effective for finding new members of non-coding RNA sequence families in genomic data. However, the computation burden of applying CM-based search algorithms can be prohibitive. When annotating the genome of a newly sequenced organism it is usually desired to search the sequence data using a large number of ncRNA families. Computational burden can be reduced if the families are clustered into statistically similar models and a single cluster-average representative model produced. The database is then searched with the representative model for each cluster at a relatively low detection threshold. The output of this pre-filtered database is then processed with the individual family members of the cluster. A base-pair conflict metric has previously been proposed for use in model clustering. In this work an alternative metric using standard alignment algorithms and a special mixed primary-secondary structure scoring matrix is proposed.

Joint Loop End Modeling Improves Covariance Model Based Non-Coding RNA Gene Search

Wed, 22 Sep 2010 07:00:00 +0000

The effect of more detailed modeling of the interface between stem and loop in non-coding RNA hairpin structures on efficacy of covariance-model-based non-coding RNA gene search is examined. Currently, the prior probabilities of the two stem nucleotides and two loop-end nucleotides at the interface are treated the same as any other stem and loop nucleotides respectively. Laboratory thermodynamic studies show that hairpin stability is dependent on the identities of these four nucleotides, but this is not taken into account in current covariance models. It is shown that separate estimation of emission priors for these nucleotides and joint treatment of substitution probabilities for the two loop-end nucleotides leads to improved non-coding RNA gene search.

Computation Intelligence Method to Find Generic Non-Coding RNA Search Models

Jennifer A. Smith — Sun, 02 May 2010 07:00:00 +0000

Fairly effective methods exist for finding new noncoding RNA genes using search models based on known families of ncRNA genes (for example covariance models). However, these models only find new members of the existing families and are not useful in finding potential members of novel ncRNA families. Other problems with family-specific search include large processing requirements, ambiguity in defining which sequences form a family and lack of sufficient numbers of known sequences to properly estimate model parameters. An ncRNA search model is proposed which includes a collection of non-overlapping RNA hairpin structure covariance models. The hairpin models are chosen from a hairpin-model list compiled from many families in the Rfam non-coding RNA families database. The specific hairpin models included and the overall score threshold for the search model is determined through the use of a genetic algorithm.

RNA Search with Decision Trees and Partial Covariance Models

Jennifer A. Smith — Wed, 01 Jul 2009 07:00:00 +0000

The use of partial covariance models to search for RNA family members in genomic sequence databases is explored. The partial models are formed from contiguous subranges of the overall RNA family multiple alignment columns. A binary decision-tree framework is presented for choosing the order to apply the partial models and the score thresholds on which to make the decisions. The decision trees are chosen to minimize computation time subject to the constraint that all of the training sequences are passed to the full covariance model for final evaluation. Computational intelligence methods are suggested to select the decision tree since the tree can be quite complex and there is no obvious method to build the tree in these cases. Experimental results from seven RNA families shows execution times of 0.066-0.268 relative to using the full covariance model alone. Tests on the full sets of known sequences for each family show that at least 95 percent of these sequences are found for two families and 100 percent for five others. Since the full covariance model is run on all sequences accepted by the partial model decision tree, the false alarm rate is at least as low as that of the full model alone.

Integrating Thermodynamic and Observed-Frequency Data for Non-Coding RNA Gene Search

Tue, 23 Dec 2008 08:00:00 +0000

Among the most powerful and commonly used methods for finding new members of non-coding RNA gene families in genomic data are covariance models. The parameters of these models are estimated from the observed position-specific frequencies of insertions, deletions, and mutations in a multiple alignment of known non-coding RNA family members. Since the vast majority of positions in the multiple alignment have no observed changes, yet there is no reason to rule them out, some form of prior is applied to the estimate. Currently, observed-frequency priors are generated from non-family members based on model node type and child node type allowing for some differentiation between priors for loops versus helices and between internal segments of structures and edges of structures. In this work it is shown that parameter estimates might be improved when thermodynamic data is combined with the consensus structure/sequence and observed-frequency priors to create more realistic position-specific priors.

Efficient Non-Coding RNA Gene Searches Through Classical and Evolutionary Methods

Thu, 11 Dec 2008 08:00:00 +0000

Successful non-coding RNA gene searching requires examination of long-range intramolecular base pairing possibilities. This results in search algorithms with extremely long run times such that large-scale use of the algorithms often becomes computationally infeasible. Methods for the efficient search of the solution space are examined. A review of the standard dynamic-programming covariance model search algorithm is given. An analysis of the statistically probable regions of the search space is undertaken and a method of limiting the traditional dynamic-programming algorithm to this region is shown. An alternative search method using a Genetic Algorithm (GA) which favours the probable region of the search space is also given.

Efficient Non-Coding RNA Gene Searches through Classical and Evolutionary Methods

Jennifer A. Smith — Thu, 11 Dec 2008 00:00:00 +0000

Improved Covariance Model Parameter Estimation Using RNA Thermodynamic Properties

Sat, 01 Dec 2007 08:00:00 +0000

Covariance models are a powerful description of non-coding RNA (ncRNA) families that can be used to search nucleotide databases for new members of these ncRNA families. Currently, estimation of the parameters of a covariance model (state transition and emission scores) is based only on the observed frequencies of mutations, insertions, and deletions in known ncRNA sequences. For families with very few known members, this can result in rather uninformative models where the consensus sequence has a good score and most deviations from consensus have a fairly uniform poor score. It is proposed here to combine the traditional observed-frequency information with known information about free energy changes in RNA helix formation and loop length changes. More thermodynamically probable deviations from the consensus sequence will then be favored in database search. The thermodynamic information may be incorporated into the models as informative priors that depend on neighboring consensus nucleotides and on loop lengths.

RNA Gene Finding with Biased Mutation Operators

Sun, 01 Apr 2007 07:00:00 +0000

The use of genetic algorithms for non-coding RNA gene finding has previously been investigated and found to be a potentially viable method for accelerating covariance-model-based database search relative to full dynamic-programming methods. The mutation operators in previous work chose new alignment insertion and deletion locations uniformly over the length of the model consensus sequence. Since the covariance models are estimated from multiple known members of a non-coding RNA family, information is available as to the likelihood of insertions or deletions at the individual model positions. This information is implicit in the state-transition parameters of the estimated covariance models. In the current work, the use of mutation operators which are biased toward selection of insertions and deletions at model positions with low insertion or deletion penalties is examined in hopes of speeding up convergence. The performance of the biased and unbiased mutation operators is compared. Both biased and unbiased genetic algorithms are also compared to a steepest-descent algorithm, which is a comparison lacking in prior work.

Covariance Searches for ncRNA Gene Finding

Fri, 01 Sep 2006 07:00:00 +0000

The use of covariance models for non-coding RNA gene finding is extremely powerful and also extremely computationally demanding. A major reason for the high computational burden of this algorithm is that the search proceeds through every possible start position in the database and every possible sequence length between zero and a user-defined maximum length at every one of these start positions. Furthermore, for every start position and sequence length, all possible combinations of insertions and deletions leading to the given sequence length are searched. It has been previously shown that a large portion of this search space is nowhere near any database match observed in practice and that the search space can be limited significantly with little change in expected search results. In this work a different approach is taken in which the space of starting positions, sequence lengths, and insertion/deletion patterns is searched using a genetic algorithm.