Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach

Sanja Štajner, Ruslan Mitkov


Abstract
A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties - British and American, using the `Brown family' of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora.
Anthology ID:
L12-1172
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1577–1584
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/355_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Sanja Štajner and Ruslan Mitkov. 2012. Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1577–1584, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach (Štajner & Mitkov, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/355_Paper.pdf