Jan Egger
2023
On the Impact of Cross-Domain Data on German Language Models
Amin Dada
|
Aokun Chen
|
Cheng Peng
|
Kaleb Smith
|
Ahmad Idrissi-Yaghir
|
Constantin Seibold
|
Jianning Li
|
Lars Heiliger
|
Christoph Friedrich
|
Daniel Truhn
|
Jan Egger
|
Jiang Bian
|
Jens Kleesiek
|
Yonghui Wu
Findings of the Association for Computational Linguistics: EMNLP 2023
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.
Search
Co-authors
- Amin Dada 1
- Aokun Chen 1
- Cheng Peng 1
- Kaleb Smith 1
- Ahmad Idrissi-Yaghir 1
- show all...