Skip to main content
Unpublished Paper
Gene Prediction with Conditional Random Fields
(2005)
  • Aron Culotta
  • David Kulp
  • Andrew McCallum, University of Massachusetts - Amherst
Abstract
Given a sequence of DNA nucleotide bases, the task of gene prediction is to find subsequences of bases that encode proteins. Reasonable performance on this task has been achieved using generatively trained sequence models, such as hidden Markov models. We propose instead the use of a discriminitively trained sequence model, the conditional random field (CRF). CRFs can naturally incorporate arbitrary, non-independent features of the input without making conditional independence assumptions among the features. This can be particularly important for gene finding, where including evidence from protein databases, EST data, or tiling arrays may improve accuracy. We evaluate our model on human genomic data, and show that CRFs perform better than HMM-based models at incorporating homology evidence from protein databases, achieving a 10% reduction in base-level errors.
Disciplines
Publication Date
2005
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Aron Culotta, David Kulp and Andrew McCallum. "Gene Prediction with Conditional Random Fields" (2005)
Available at: http://works.bepress.com/andrew_mccallum/143/