How Far Does the Sequence of Compositions Impact Multilingual Pre-Training?

Leonardo Ranaldi, Giulia Pucci, Fabio Massimo Zanzotto


Abstract
The most efficient strategy for conducting pre-training of language models is the concatenation of contiguous sequences of text of fixed length through causal masking that estimates the probability of each token given its context.However, the role of the composition sequence pre-training technique in the models’ generalization properties has yet to be explored.In this paper, we show that operating via causal masking impacts model performance because it could include misleading information from previous text sequences during pre-training.To fill this gap, we propose intra-context causal masking where the probability of each token is conditional only on the previous in the same chunk of text, avoiding misleading information from different contexts.Hence, we demonstrate that organizing text chunks based on a policy that aligns with text similarity effectively reduces the risk of misleading context during pre-training by enhancing language models’ in-context learning and factual knowledge storage capabilities while maintaining efficiency.
Anthology ID:
2024.clicit-1.86
Volume:
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)
Month:
December
Year:
2024
Address:
Pisa, Italy
Editors:
Felice Dell'Orletta, Alessandro Lenci, Simonetta Montemagni, Rachele Sprugnoli
Venue:
CLiC-it
SIG:
Publisher:
CEUR Workshop Proceedings
Note:
Pages:
796–804
Language:
URL:
https://aclanthology.org/2024.clicit-1.86/
DOI:
Bibkey:
Cite (ACL):
Leonardo Ranaldi, Giulia Pucci, and Fabio Massimo Zanzotto. 2024. How Far Does the Sequence of Compositions Impact Multilingual Pre-Training?. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), pages 796–804, Pisa, Italy. CEUR Workshop Proceedings.
Cite (Informal):
How Far Does the Sequence of Compositions Impact Multilingual Pre-Training? (Ranaldi et al., CLiC-it 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clicit-1.86.pdf