Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy; Vincent Labatut; Xavier Bost; Hen-Hsen Huang

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy, Vincent Labatut, Xavier Bost, Hen-Hsen Huang

Abstract

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully share the annotations of any sequential copyrighted corpus. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator’s version. We publicly release novelties-bookshare, a Python implementation of our method.

Anthology ID:: 2026.acl-long.2149
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 46329–46345
Language:
URL:: https://aclanthology.org/2026.acl-long.2149/
DOI:
Bibkey:
Cite (ACL):: Arthur Amalvy, Vincent Labatut, Xavier Bost, and Hen-Hsen Huang. 2026. Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 46329–46345, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing (Amalvy et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.2149.pdf
Checklist:: 2026.acl-long.2149.checklist.pdf

PDF Cite Search Checklist Fix data