Skip to main content
Unpublished Paper
Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval
(2007)
  • Xuerui Wang
  • Andrew McCallum, University of Massachusetts - Amherst
  • Xing Wei
Abstract
Most topic models, such as latent Dirichlet allocation, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. This paper presents Topical N-grams, a topic model that discovers topics as well as the individual words and phrases that define their meaning. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can represent that the phrase ``white house'' has special meaning in the `politics' topic, but not the `real estate' topic. Successive bigrams form longer phrases. We present experimental results showing more interpretable topics from NIPS data, and improved information retrieval performance on a significant TREC collection.
Disciplines
Publication Date
2007
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Xuerui Wang, Andrew McCallum and Xing Wei. "Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval" (2007)
Available at: http://works.bepress.com/andrew_mccallum/120/