Abstract
While widespread languages remain actively prevalent in digital mediums, endangered languages such as Indigenous Australian languages, are often scarce in textual resources and lack a substantial digital presence. This diminishes the survivability of their language and cultural heritages, making language preservation initiatives important. Neural Machine Translation (NMT) efforts for low-resource languages can help accelerate the digitisation and preservation of such languages by further enabling translation, information access, and second-language acquisition. This study explores the challenges and considerations with low-resource machine translation (MT), specifically focusing on primarily oral Indigenous Australian languages with minimally available textual resources. Additionally, we explore the existing challenges in the search for quality data and ethical research considerations in approaching Indigenous Cultural Intellectual Property (ICIP). As NMT performance often scales with the quality and quantity of multilingual corpora, we explore promising alternatives such as leveraging large language models (LLMs) to tackle severely low-resource MT as a few-shot prompting translation task. By employing a data imputation approach inspired by Continuous Bag-of-Words (CBOW) to strengthen a prompt’s contextual relevancy, we enhance translations generated by LLMs, achieving a chrF score of 37.3 on imputed data, compared to a baseline of 31.6 with GPT-3.5, and 39.3 compared to a baseline of 38.3 on GPT-4. Through our work, we hope to establish a foundation for future efforts in preserving Indigenous Australian languages.