"Canonicalization of Database Records using Adaptive Similarity Measures" by Aron Culotta

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Canonicalization of Database Records using Adaptive Similarity Measures

(2007)

Aron Culotta
Michael Wick
Robert Hall
Matthew Marzilli
Andrew McCallum, University of Massachusetts - Amherst

Download

Abstract

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is ``central'' in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. We empirically evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

Keywords

Information Systems,
Database Management,
Database Applications,
Data mining,
information extraction,
data cleaning

Disciplines

Computer Sciences

Publication Date

2007

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, et al.. "Canonicalization of Database Records using Adaptive Similarity Measures" (2007)
Available at: http://works.bepress.com/andrew_mccallum/104/