"Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval" by Xuerui Wang

Selected Works of Andrew McCallum

Follow Contact

Unpublished Paper

Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval

(2007)

Xuerui Wang
Andrew McCallum, University of Massachusetts - Amherst
Xing Wei

Download

Abstract

Most topic models, such as latent Dirichlet allocation, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. This paper presents Topical N-grams, a topic model that discovers topics as well as the individual words and phrases that define their meaning. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can represent that the phrase ``white house'' has special meaning in the `politics' topic, but not the `real estate' topic. Successive bigrams form longer phrases. We present experimental results showing more interpretable topics from NIPS data, and improved information retrieval performance on a significant TREC collection.

Disciplines

Computer Sciences

Publication Date

2007

Comments

This is the pre-published version harvested from CIIR.

Citation Information

Xuerui Wang, Andrew McCallum and Xing Wei. "Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval" (2007)
Available at: http://works.bepress.com/andrew_mccallum/120/