"Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models" by Sameer Singh

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

(2011)

Sameer Singh
Amarnag Subramanya
Refnando Pereira
Andrew McCallum, University of Massachusetts - Amherst

Download

Abstract

Cross-document coreference, the task of grouping all the mentions of each entity in a document collection, arises in information extraction and automated knowledge base construction. For large collections, it is clearly impractical to consider all possible groupings of mentions into distinct entities. To solve the problem we propose two ideas: (a) a distributed inference technique that uses parallelism to enable large scale processing, and (b) a hierarchical model of coreference that represents uncertainty over multiple granularities of entities to facilitate more effective approximate inference. To evaluate these ideas, we constructed a labeled corpus with 1.5 million mentions from links to Wikipedia entities available on the web. We show that the combination of the hierarchical model and distributed inference quickly obtains high accuracy (38% error reduction) on this large dataset, demonstrating the scalability of our approach.

Disciplines

Computer Sciences

Publication Date

2011

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Sameer Singh, Amarnag Subramanya, Refnando Pereira and Andrew McCallum. "Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models" (2011)
Available at: http://works.bepress.com/andrew_mccallum/72/