Skip to main content
Presentation
Combining Shallow and Linguistically Motivated Features in Native Language Identification
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (2013)
  • Serhiy Bykh, Universitat Tubingen
  • Sowmya Vajjala, Universitat Tubingen
  • Julia Krivanek, Universitat Tubingen
  • Detmar Meurers, Universität Tübingen
Abstract
We explore a range of features and ensembles for the task of Native Language Identification as part of the NLI Shared Task (Tetreault et al., 2013). Starting with recurring word-based ngrams (Bykh and Meurers, 2012), we tested different linguistic abstractions such as partof-speech, dependencies, and syntactic trees as features for NLI. We also experimented with features encoding morphological properties, the nature of the realizations of particular lemmas, and several measures of complexity developed for proficiency and readability classification (Vajjala and Meurers, 2012). Employing an ensemble classifier incorporating all of our features we achieved an accuracy of 82.2% (rank 5) in the closed task and 83.5% (rank 1) in the open-2 task. In the open-1 task, the word-based recurring ngrams outperformed the ensemble, yielding 38.5% (rank 2). Overall, across all three tasks, our best accuracy of 83.5% for the standard TOEFL11 test set came in second place
Publication Date
June, 2013
Location
Atlanta, GA
Comments
Copyright 2013 The Authors
Citation Information
Serhiy Bykh, Sowmya Vajjala, Julia Krivanek and Detmar Meurers. "Combining Shallow and Linguistically Motivated Features in Native Language Identification" Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications (2013)
Available at: http://works.bepress.com/sowmya-vajjala/11/