"Learning Field Compatibilities to Extract Database Records from Unstructured Text" by Michael Wick

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Learning Field Compatibilities to Extract Database Records from Unstructured Text

(2006)

Michael Wick
Aron Culotta
Andrew McCallum, University of Massachusetts - Amherst

Download

Abstract

Named-entity recognition systems extract entities in text by type, such as people, organizations, and locations from unstructured text. Rather than extract these fields in isolation, in this paper we present a record extraction system that clusters fields together into records (i.e. database tuples). We construct a probabilistic model of the compatibility of field values, then employ graph partitioning algorithms to partition fields into cohesive records. We also investigate compatibility functions over sets of fields, rather than simply pairs of fields, to examine how higher representational power can impact performance. We apply our techniques to the task of extracting contact records from faculty and student homepages, demonstrating a 38% error reduction over baseline approaches.

Disciplines

Computer Sciences

Publication Date

2006

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Michael Wick, Aron Culotta and Andrew McCallum. "Learning Field Compatibilities to Extract Database Records from Unstructured Text" (2006)
Available at: http://works.bepress.com/andrew_mccallum/132/