Skip to main content
Article
Large Scale, Multi-domain Language Identification
Synthesis Lectures on Human Language Technologies
  • Tommi Jauhiainen, Helsingin Yliopisto
  • Marcos Zampieri, George Mason University
  • Timothy Baldwin, Mohamed Bin Zayed University of Artificial Intelligence
  • Krister Lindén, Helsingin Yliopisto
Document Type
Article
Abstract

In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.

DOI
10.1007/978-3-031-45822-4_5
Publication Date
1-2-2024
Keywords
  • Artificial intelligence,
  • Domain language,
  • Evaluation measures,
  • Language identification,
  • Large-scales,
  • Multi-domains,
  • Recognizable languages,
  • Wikipedia
Comments

IR conditions: non-described

Citation Information
Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin and Krister Lindén. "Large Scale, Multi-domain Language Identification" Synthesis Lectures on Human Language Technologies Vol. Part F2039 (2024) p. 117 - 135 ISSN: 19474040
Available at: http://works.bepress.com/timothy-baldwin/30/