Skip to main content
Presentation
Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialect, pages 115–123, (2017)
  • Yves Bestgen
Abstract
This paper describes the system developed
by the Centre for English Corpus Linguistics
(CECL) to discriminating similar languages,
language varieties and dialects.
Based on a SVM with character and
POStag n-grams as features and the BM25
weighting scheme, it achieved 92.7% accuracy
in the Discriminating between Similar
Languages (DSL) task, ranking first
among eleven systems but with a lead over
the next three teams of only 0.2%. A simpler
version of the system ranked second
in the German Dialect Identification (GDI)
task thanks to several ad hoc postprocessing
steps. Complementary analyses carried
out by a cross-validation procedure
suggest that the BM25 weighting scheme
could be competitive in this type of tasks,
at least in comparison with the sublinear
TF-IDF. POStag n-grams also improved
the system performance.
Publication Date
April 3, 2017
Location
Valencia, Spain,
Citation Information
Yves Bestgen. "Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets" Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialect, pages 115–123, (2017)
Available at: http://works.bepress.com/yvesbestgen/12/