DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity

Marco Di Giovanni, Thomas Tasca, Marco Brambilla


Abstract
In this paper, we describe the approach we designed to solve SemEval-2022 Task 8: Multilingual News Article Similarity. We collect and use exclusively textual features (title, description and body) of articles. Our best model is a stacking of 14 Transformer-based Language models fine-tuned on single or multiple fields, using data in the original language or translated to English. It placed fourth on the original leaderboard, sixth on the complete official one and fourth on the English-subset official one. We observe the data collection as our principal source of error due to a relevant fraction of missing or wrong fields.
Anthology ID:
2022.semeval-1.174
Volume:
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Guy Emerson, Natalie Schluter, Gabriel Stanovsky, Ritesh Kumar, Alexis Palmer, Nathan Schneider, Siddharth Singh, Shyam Ratan
Venue:
SemEval
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
1229–1234
Language:
URL:
https://aclanthology.org/2022.semeval-1.174
DOI:
10.18653/v1/2022.semeval-1.174
Bibkey:
Cite (ACL):
Marco Di Giovanni, Thomas Tasca, and Marco Brambilla. 2022. DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1229–1234, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity (Di Giovanni et al., SemEval 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.semeval-1.174.pdf
Video:
 https://aclanthology.org/2022.semeval-1.174.mp4