Supervised Detection of Conserved Motifs in DNA Sequences with cosmo
Abstract
Identification of transcription factor binding sites is a major interest in contemporary biological research. A number of computational methods have been proposed to identify these regulatory motifs from a set of unaligned sequences that are thought to share the motif in question. Keles et al. (2003) introduced an algorithm called COMODE that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question.
We here present a new implementation of this algorithm, called cosmo, available as a stand-alone program as well as a web application at http://cosmoweb.berkeley.edu, that is considerably faster as well as more user-friendly than COMODE. By basing the estimation of the intensity parameters in the ZOOPS and TCM models no longer on the likelihood function but rather on the E-value of the resulting multiple alignment, cosmo is made competitive with MEME even in the absence of correctly specified constraints. The E-value furthermore replaces likelihood-based cross-validation as the model selection criterion for choosing the unknown motif width. Cross-validation based on the Euclidean norm between two position weight matrices is used rather than likelihood-based cross-validation to select an appropriate constraint set from a collection of candidate constraint sets. The model type (OOPS, ZOOPS, or TCM) can now also be chosen data-adaptively based on the E-value criterion rather than having to be specified a priori. These choices of different model selection techniques for different problems are based on extensive simulation studies and underline the notion that likelihood-based cross-validation is aimed specifically at density estimation but may in fact not be optimal for the purpose of estimating a lower dimensional functional of that density.
We illustrate that correctly specified constraints can lead to considerably improved performance in situations in which the motif appears only as a weak signal in the data. At the same time, we demonstrate that the algorithm can data-adaptively choose between working in a given constrained model and the completely unconstrained model, protecting the user from the risk of mis-specifying the constraint set.
Suggested Citation
Oliver Bembom, Sunduz Keles, and Mark J. van der Laan. "Supervised Detection of Conserved Motifs in DNA Sequences with cosmo" 2006
Available at: http://works.bepress.com/mark_van_der_laan/160