Justin DeBenedetto
2023
Byte-ranked Curriculum Learning for BabyLM Strict-small Shared Task 2023
Justin DeBenedetto
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines
Stephen Bothwell
|
Justin DeBenedetto
|
Theresa Crnkovich
|
Hildegund Müller
|
David Chiang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is parallelism: the juxtaposition of phrases which have the same sequence of linguistic (e.g., phonological, syntactic, semantic) features. Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it, missing a chance to better understand the nature of the structure, meaning, and intent that humans convey. To address this, we introduce the task of rhetorical parallelism detection. We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it; and, lastly, we create baseline systems and novel sequence labeling schemes to capture it. On our strictest metric, we attain F1 scores of 0.40 and 0.43 on our Latin and Chinese datasets, respectively.
2018
Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource
Antonios Anastasopoulos
|
Marika Lekakou
|
Josep Quer
|
Eleni Zimianiti
|
Justin DeBenedetto
|
David Chiang
Proceedings of the 27th International Conference on Computational Linguistics
Most work on part-of-speech (POS) tagging is focused on high resource languages, or examines low-resource and active learning settings through simulated studies. We evaluate POS tagging techniques on an actual endangered language, Griko. We present a resource that contains 114 narratives in Griko, along with sentence-level translations in Italian, and provides gold annotations for the test set. Based on a previously collected small corpus, we investigate several traditional methods, as well as methods that take advantage of monolingual data or project cross-lingual POS tags. We show that the combination of a semi-supervised method with cross-lingual transfer is more appropriate for this extremely challenging setting, with the best tagger achieving an accuracy of 72.9%. With an applied active learning scheme, which we use to collect sentence-level annotations over the test set, we achieve improvements of more than 21 percentage points.
Search
Fix data
Co-authors
- David Chiang 2
- Antonios Anastasopoulos 1
- Stephen Bothwell 1
- Theresa Crnkovich 1
- Marika Lekakou 1
- show all...