Skip to main content
Unpublished Paper
A Note on Topical N-grams
(2005)
  • Xuerui Wang
  • Andrew McCallum, University of Massachusetts - Amherst
Abstract
Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some context, but not in other contexts. In this paper, we propose a new probabilistic generative model that automatically determines unigram words and phrases based on context and simultaneously associates them with mixture of topics, and show very interesting results on large text corpora.
Disciplines
Publication Date
2005
Comments
This is the pre-published version harvested from CIIR.
Citation Information
Xuerui Wang and Andrew McCallum. "A Note on Topical N-grams" (2005)
Available at: http://works.bepress.com/andrew_mccallum/127/