"Generalized Component Analysis for Text with Heterogeneous Attributes" by Xuerui Wang

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Generalized Component Analysis for Text with Heterogeneous Attributes

(2007)

Xuerui Wang
Chris Pal
Andrew McCallum, University of Massachusetts - Amherst

Download

Abstract

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as Principal Component Analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be more broadly applicable to many real-world applications where one wishes to efficiently make predictions for a large number of potential outputs -- such as targeted advertising.

Keywords

data mining,
Database Applications,
Database Management,
Artificial Intelligence,
Undirected Graphical Models,
Topic Modeling,
Text Mining,
Author Prediction,
Recipient Prediction,
Multimodal Heterogeneous Data

Disciplines

Computer Sciences

Publication Date

2007

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Xuerui Wang, Chris Pal and Andrew McCallum. "Generalized Component Analysis for Text with Heterogeneous Attributes" (2007)
Available at: http://works.bepress.com/andrew_mccallum/101/