Skip to main content
Unpublished Paper
Learning Extractors from Unlabeled Text using Relevant Databases
  • Kedar Bellare
  • Andrew McCallum, University of Massachusetts - Amherst
Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.
Publication Date
This is the pre-published version harvested from CIIR.
Citation Information
Kedar Bellare and Andrew McCallum. "Learning Extractors from Unlabeled Text using Relevant Databases" (2007)
Available at: