Skip to main content
Article
Specific Challenges of Variation and Text Types
Synthesis Lectures on Human Language Technologies
  • Tommi Jauhiainen, Helsingin Yliopisto
  • Marcos Zampieri, George Mason University
  • Timothy Baldwin, Mohamed Bin Zayed University of Artificial Intelligence
  • Krister Lindén, Helsingin Yliopisto
Document Type
Article
Abstract

One fascinating aspect of language identification which makes it difficult is the similarity between languages. Some languages seem to be extremely easy to distinguish from each other, whereas for some others, it is extremely difficult. This phenomenon is closely tied to the definition of “language”, which is much less trivial than what one might think. It is hard to draw the line between languages and dialects. For example, mutual intelligibility is one of the measures often mentioned, but this is highly subjective and very difficult to measure objectively. Several organizations have defined lists of languages. Ethnologue: Languages of the World is currently in its 25th edition, and lists 7,168 known living languages. It is published by the SIL International, which is also responsible for the ISO 639-3 standard consisting of three-letter codes representing individual languages. Library of Congress is the registration authority for the ISO 639-2 standard consisting of the ISO 639-3 compatible three-letter codes for a considerably smaller number of languages, still continuously updated as well. Glottolog, published by the Max Planck Institute, lists 8,572 entries in its version 4.7. Linguasphere Register volume two includes over 30,000 languages and dialects. Of these lists, ISO 639-3 and its subset ISO 639-2 are the most widely used even though the two-letter codes from ISO 639-1 are still in use on many occasions.

DOI
10.1007/978-3-031-45822-4_4
Publication Date
1-2-2024
Keywords
  • Language identification,
  • Library of congress,
  • Max Planck Institute,
  • Registration Authority
Comments

IR conditions: non-described

Citation Information
Tommi Jauhiainen, Marcos Zampieri, Timothy Baldwin and Krister Lindén. "Specific Challenges of Variation and Text Types" Synthesis Lectures on Human Language Technologies Vol. Part F2039 (2024) p. 99 - 115 ISSN: 19474040
Available at: http://works.bepress.com/timothy-baldwin/27/