"Learning Extractors from Unlabeled Text using Relevant Databases" by Kedar Bellare

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Learning Extractors from Unlabeled Text using Relevant Databases

(2007)

Kedar Bellare
Andrew McCallum, University of Massachusetts - Amherst

Download

Abstract

Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.

Disciplines

Computer Sciences

Publication Date

2007

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Kedar Bellare and Andrew McCallum. "Learning Extractors from Unlabeled Text using Relevant Databases" (2007)
Available at: http://works.bepress.com/andrew_mccallum/99/