"Large Scale, Multi-domain Language Identification" by Tommi Jauhiainen

Selected Works of Timothy Baldwin

Article

Large Scale, Multi-domain Language Identification

Synthesis Lectures on Human Language Technologies

Tommi Jauhiainen, Helsingin Yliopisto
Marcos Zampieri, George Mason University
Timothy Baldwin, Mohamed Bin Zayed University of Artificial Intelligence
Krister Lindén, Helsingin Yliopisto

Link

Document Type

Article

Abstract

In general, the more recognizable languages there are, the more difficult it is to recognize the language (Brown 2012; Rodrigues 2012; Jauhiainen et al. 2017a). It is intuitively easy to understand that if classes are added, the classification becomes more difficult. However, this depends in part on the evaluation measures used. For example, if the average accuracy of all languages is measured, it may improve when easily distinguishable languages are added to the language selection. Brown (2014) presents results where the average accuracy is higher for 1366 languages than for a subset of 781 languages. He explains this phenomenon by the fact that a larger proportion of languages in a smaller repertoire are based on Wikipedia texts, which are often multilingual, containing lots of texts in unintended languages. Most language identification research has focused on a relatively small number of languages. In Table 5.1, we have listed references that have empirically tested language identifiers with 100 or more languages.

DOI

10.1007/978-3-031-45822-4_5

Publication Date

1-2-2024

Keywords

Artificial intelligence,
Domain language,
Evaluation measures,
Language identification,
Large-scales,
Multi-domains,
Recognizable languages,
Wikipedia

Disciplines

Comments

IR conditions: non-described

Additional Links

DOI link: https://doi.org/10.1007/978-3-031-45822-4_5

Citation Information

Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin and Krister Lindén. "Large Scale, Multi-domain Language Identification" Synthesis Lectures on Human Language Technologies Vol. Part F2039 (2024) p. 117 - 135 ISSN: 19474040
Available at: http://works.bepress.com/timothy-baldwin/30/