Parallel Text Alignment and Monolingual Parallel Corpus Creation from Philosophical Texts for Text Simplification

Stefan Paun


Abstract
Text simplification is a growing field with many potential useful applications. Training text simplification algorithms generally requires a lot of annotated data, however there are not many corpora suitable for this task. We propose a new unsupervised method for aligning text based on Doc2Vec embeddings and a new alignment algorithm, capable of aligning texts at different levels. Initial evaluation shows promising results for the new approach. We used the newly developed approach to create a new monolingual parallel corpus composed of the works of English early modern philosophers and their corresponding simplified versions.
Anthology ID:
2021.naacl-srw.6
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Month:
June
Year:
2021
Address:
Online
Editors:
Esin Durmus, Vivek Gupta, Nelson Liu, Nanyun Peng, Yu Su
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40–46
Language:
URL:
https://aclanthology.org/2021.naacl-srw.6
DOI:
10.18653/v1/2021.naacl-srw.6
Bibkey:
Cite (ACL):
Stefan Paun. 2021. Parallel Text Alignment and Monolingual Parallel Corpus Creation from Philosophical Texts for Text Simplification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 40–46, Online. Association for Computational Linguistics.
Cite (Informal):
Parallel Text Alignment and Monolingual Parallel Corpus Creation from Philosophical Texts for Text Simplification (Paun, NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-srw.6.pdf
Video:
 https://aclanthology.org/2021.naacl-srw.6.mp4
Code
 stefanpaun/massalign
Data
Newsela