Edwin Thomas


2024

pdf bib
Keyphrase Generation: Lessons from a Reproducibility Study
Edwin Thomas | Sowmya Vajjala
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Reproducibility studies are treated as means to verify the validity of a scientific method, but what else can we learn from such experiments? We addressed this question taking Keyphrase Generation (KPG) as the use case in this paper, by studying three state-of-the-art KPG models in terms of reproducibility under either the same (same data/model/code) or varied (different training data/model, but same code) conditions, and exploring different ways of comparing KPG models beyond the most commonly used evaluation measures. We drew some conclusions on the state of the art in KPG based on these experiments, and provided guidelines for researchers working on the topic about reporting experimental results in a more comprehensive manner.

pdf bib
Improving Absent Keyphrase Generation with Diversity Heads
Edwin Thomas | Sowmya Vajjala
Findings of the Association for Computational Linguistics: NAACL 2024

Keyphrase Generation (KPG) is the task of automatically generating appropriate keyphrases for a given text, with a wide range of real-world applications such as document indexing and tagging, information retrieval, and text summarization. NLP research makes a distinction between present and absent keyphrases based on whether a keyphrase is directly present as a sequence of words in the document during evaluation. However, present and absent keyphrases are treated together in a text-to-text generation framework during training. We treat present keyphrase extraction as a sequence labeling problem and propose a new absent keyphrase generation model that uses a modified cross-attention layer with additional heads to capture diverse views for the same context encoding in this paper. Our experiments show improvements over the state-of-the-art for four datasets for present keyphrase extraction and five datasets for absent keyphrase generation among the six English datasets we explored, covering long and short documents.