Skip to main content
Article
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
arXiv
  • Alham Fikri Aji, Amazon, United States
  • Genta Indra Winata, Bloomberg
  • Fajri Koto, The University of Melbourne, Australia
  • Samuel Cahyawijaya, HKUST, Hong Kong
  • Ade Romadhony, Telkom University, Indonesia & INACL, Indonesia
  • Rahmad Mahendra, Universitas Indonesia, Indonesia & INACL, Indonesia
  • Kemal Kurniawan, The University of Melbourne, Australia & INACL Indonesia
  • David Moeljadi, Kanda University of International Studies, Japan
  • Radityo Eko Prasojo, Kata.ai, Indonesia
  • Timothy Baldwin, The University of Melbourne, Australia & Mohamed Bin Zayed University of Artificial Intelligence
  • Jey Han Lau, The University of Melbourne, Australia
  • Sebastian Ruder, Google Research, United States
Document Type
Article
Abstract

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages. Copyright © 2022, The Authors. All rights reserved.

DOI
doi.org/10.48550/arXiv.2203.13357
Publication Date
3-24-2022
Keywords
  • NLP systems,
  • Computation and Language (cs.CL)
Comments

Preprint: arXiv

Archived with thanks to arXiv

Citation Information
A.F. Aji et al, "One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia", 2022, arXiv:2203.13357v1