"Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand" by Pallika Kanani

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand

(2010)

Pallika Kanani
Andrew McCallum, University of Massachusetts - Amherst
Shaohan Hu

Download

Abstract

We present a general framework for the task of extracting specific information ``on demand'' from a large corpus such as the Web under resource-constraints. Given a database with missing or uncertain information, the proposed system automatically formulates queries, issues them to a search interface, selects a subset of the documents, extracts the required information from them, and fills the missing values in the original database. We also exploit inherent dependency within the data to obtain useful information with fewer computational resources. We build such a system in the citation database domain that extracts the missing publication years using limited resources from the Web. We discuss a probabilistic approach for this task and present first results. The main contribution of this paper is to propose a general, comprehensive architecture for designing a system adaptable to different domains.

Disciplines

Computer Sciences

Publication Date

2010

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Pallika Kanani, Andrew McCallum and Shaohan Hu. "Resource-bounded Information Extraction: Acquiring Missing Feature Values On Demand" (2010)
Available at: http://works.bepress.com/andrew_mccallum/79/