Yusuf Can Semerci
Also published as: Yusuf Can Semerci
2025
Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models
Paweł Mąka
|
Yusuf Can Semerci
|
Jan Scholtes
|
Gerasimos Spanakis
Proceedings of the 31st International Conference on Computational Linguistics
In this paper, we investigate the role of attention heads in Context-aware Machine Translation models for pronoun disambiguation in the English-to-German and English-to-French language directions. We analyze their influence by both observing and modifying the attention scores corresponding to the plausible relations that could impact a pronoun prediction. Our findings reveal that while some heads do attend the relations of interest, not all of them influence the models’ ability to disambiguate pronouns. We show that certain heads are underutilized by the models, suggesting that model performance could be improved if only the heads would attend one of the relations more strongly. Furthermore, we fine-tune the most promising heads and observe the increase in pronoun disambiguation accuracy of up to 5 percentage points which demonstrates that the improvements in performance can be solidified into the models’ parameters.
2024
Fixed and Adaptive Simultaneous Machine Translation Strategies Using Adapters
Abderrahmane Issam
|
Yusuf Can Semerci
|
Jan Scholtes
|
Gerasimos Spanakis
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)
Simultaneous machine translation aims at solving the task of real-time translation by starting to translate before consuming the full input, which poses challenges in terms of balancing quality and latency of the translation. The wait-k policy offers a solution by starting to translate after consuming words, where the choice of the number k directly affects the latency and quality. In applications where we seek to keep the choice over latency and quality at inference, the wait-k policy obliges us to train more than one model. In this paper, we address the challenge of building one model that can fulfil multiple latency levels and we achieve this by introducing lightweight adapter modules into the decoder. The adapters are trained to be specialized for different wait-k values and compared to other techniques they offer more flexibility to allow for reaping the benefits of parameter sharing and minimizing interference. Additionally, we show that by combining with an adaptive strategy, we can further improve the results. Experiments on two language directions show that our method outperforms or competes with other strong baselines on most latency values.