Yohei Igarashi


2021

pdf bib
Varieties of Plain Language
Allen Riddell | Yohei Igarashi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Many organizations seek or need to produce documents that are written plainly. In the United States, the “Plain Writing Act of 2010” requires that many federal agencies’ documents for the public are written in plain English. In particular, the government’s Plain Language Action and Information Network (“PLAIN”) recommends that writers use short sentences and everyday words, as does the Securities and Exchange Commission’s “Plain English Rule.” Since the 1970s, American plain language advocates have moved away from readability measures and favored usability testing and document design considerations. But in this paper we use quantitative measures of sentence length and word difficulty that (1) reveal stylistic variation among PLAIN’s exemplars of plain writing, and (2) help us position PLAIN’s exemplars relative to documents written in other kinds of accessible English (e.g., The New York Times, Voice of America Special English, and Wikipedia) and one academic document likely to be perceived as difficult. Uncombined measures for sentences and vocabulary—left separate, unlike in traditional readability formulas—can complement usability testing and document design considerations, and advance knowledge about different types of plainer English.

pdf bib
Automating the Detection of Poetic Features: The Limerick as Model Organism
Almas Abdibayev | Yohei Igarashi | Allen Riddell | Daniel Rockmore
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper we take up the problem of “limerick detection” and describe a system to identify five-line poems as limericks or not. This turns out to be a surprisingly difficult challenge with many subtleties. More precisely, we produce an algorithm which focuses on the structural aspects of the limerick – rhyme scheme and rhythm (i.e., stress patterns) – and when tested on a a culled data set of 98,454 publicly available limericks, our “limerick filter” accepts 67% as limericks. The primary failure of our filter is on the detection of “non-standard” rhymes, which we highlight as an outstanding challenge in computational poetics. Our accent detection algorithm proves to be very robust. Our main contributions are (1) a novel rhyme detection algorithm that works on English words including rare proper nouns and made-up words (and thus, words not in the widely used CMUDict database); (2) a novel rhythm-identifying heuristic that is robust to language noise at moderate levels and comparable in accuracy to state-of-the-art scansion algorithms. As a third significant contribution (3) we make publicly available a large corpus of limericks that includes tags of “limerick” or “not-limerick” as determined by our identification software, thereby providing a benchmark for the community. The poetic tasks that we have identified as challenges for machines suggest that the limerick is a useful “model organism” for the study of machine capabilities in poetry and more broadly literature and language. We include a list of open challenges as well. Generally, we anticipate that this work will provide useful material and benchmarks for future explorations in the field.