Skip to main content
Unpublished Paper
Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function
(2007)
  • Aron Culotta
  • Pallika Kanani
  • Robert Hall
  • Michael Wick
  • Andrew McCallum, University of Massachusetts - Amherst
Abstract
Author disambiguation is the problem of determining whether records in a publications database that contain similar author names refer to the same person. This task can be especially difficult when the database is constructed from automatically extracted data, which can contain noisy and incomplete records. A common supervised machine learning approach to author disambiguation is to build a classifier that predicts whether a pair of records is coreferent, often followed by a collective inference step to enforce transitivity of the predictions. By restricting the classifier to pairwise predictions, standard training algorithms for binary classification can be used. However, this approach ignores powerful evidence that can be obtained by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We also propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rank-based in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.
Disciplines
Publication Date
2007
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, et al.. "Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function" (2007)
Available at: http://works.bepress.com/andrew_mccallum/105/