Plains Cree (nêhiyawêwin) is an Indigenous language that is spoken in Canada and the USA. It is the most widely spoken dialect of Cree and a morphologically complex language that is polysynthetic, highly inflective, and agglutinative. It is an extremely low resource language, with no existing corpus that is both available and prepared for supporting the development of language technologies. To support nêhiyawêwin revitalization and preservation, we developed a corpus covering diverse genres, time periods, and texts for a variety of intended audiences. The data has been verified and cleaned; it is ready for use in developing language technologies for nêhiyawêwin. The corpus includes the corresponding English phrases or audio files where available. We demonstrate the utility of the corpus through its community use and its use to build language technologies that can provide the types of support that community members have expressed are desirable. The corpus is available for public use.
A common mistake made by language learners is the misguided usage of first language rules when communicating in another language. In this paper, n-gram and recurrent neural network language models are used to represent language structures and detect when Chinese native speakers incorrectly transfer rules from their first language (i.e., Chinese) into their English writing. These models make it possible to inform corrective error feedback with error causes, such as negative language transfer. We report the results of our negative language detection experiments with n-gram and recurrent neural network models that were trained using part-of-speech tags. The best performing model achieves an F1-score of 0.51 when tasked with recognizing negative language transfer in English learner data.
Automatic personalized corrective feedback can help language learners from different backgrounds better acquire a new language. This paper introduces a learner English dataset in which learner errors are accompanied by information about possible error sources. This dataset contains manually annotated error causes for learner writing errors. These causes tie learner mistakes to structures from their first languages, when the rules in English and in the first language diverge. This new dataset will enable second language acquisition researchers to computationally analyze a large quantity of learner errors that are related to language transfer from the learners’ first language. The dataset can also be applied in personalizing grammatical error correction systems according to the learners’ first language and in providing feedback that is informed by the cause of an error.