Presentation
Data Adaptive Text Extraction Techniques for Individualized Big Data Curation and the Generation of Machine Learning Models for Buddhist Canon Research
Tripitaka for the Future: Envisioning the Buddhist Canon in the Digital Age, University of Arizona
(2018)
Abstract
A variety of barriers block researchers in religious studies and the humanities from taking advantage of advances in machine learning to investigate humanistic and religious studies questions. One of the largest is a lack of large-scale datasets curated by and for specialists. A second is a lack of machine learning models made specifically to investigate religious and humanistic questions. This presentation describes attempts to help researchers overcome these barriers. Specifically, the presentation details efforts by researchers investigating the Buddhist canon to curate individualized, large-scale datasets using a new software called Mo文oN. These curatorial efforts support the goal of building machine learning models that can facilitate the digital encoding of every extant version of the Buddhist canon. Several methods are described for adaptively extracting text from images of the Qisha canon using Mo文oN to create a large dataset of tagged images of characters. The presentation concludes with a description of how these individualized, large-scale datasets can be used to generate machine learning models that can facilitate the automated transcription of the entire Qisha canon, as well as other canon materials.
Disciplines
Publication Date
November, 2018
Location
Tucson, AZ
Citation Information
Wayne de Fremery. "Data Adaptive Text Extraction Techniques for Individualized Big Data Curation and the Generation of Machine Learning Models for Buddhist Canon Research" Tripitaka for the Future: Envisioning the Buddhist Canon in the Digital Age, University of Arizona (2018) Available at: http://works.bepress.com/wayne-defremery/47/