2024
pdf
bib
abs
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Nedjma Ousidhoum
|
Shamsuddeen Muhammad
|
Mohamed Abdalla
|
Idris Abdulmumin
|
Ibrahim Ahmad
|
Sanchit Ahuja
|
Alham Aji
|
Vladimir Araujo
|
Abinew Ayele
|
Pavan Baswani
|
Meriem Beloucif
|
Chris Biemann
|
Sofia Bourhim
|
Christine Kock
|
Genet Dekebo
|
Oumaima Hourrane
|
Gopichand Kanumolu
|
Lokesh Madasu
|
Samuel Rutunda
|
Manish Shrivastava
|
Thamar Solorio
|
Nirmal Surange
|
Hailegnaw Tilaye
|
Krishnapriya Vishnubhotla
|
Genta Winata
|
Seid Yimam
|
Saif Mohammad
Findings of the Association for Computational Linguistics: ACL 2024
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.
pdf
bib
abs
Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions
Will Aitken
|
Mohamed Abdalla
|
Karen Rudie
|
Catherine Stinson
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Impressive performance of pre-trained models has garnered public attention and made news headlines in recent years. Almost always, these models are produced by or in collaboration with industry. Using them is critical for competing on natural language processing (NLP) benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine the degree to which researchers rely on industry models, other artifacts, and contributions to publish in prestigious NLP venues and found that the ratio of their citation is at least three times greater than what would be expected. Our work serves as a scaffold to enable future researchers to more accurately address whether: 1) Collaboration with industry is still collaboration in the absence of an alternative or 2) if NLP inquiry has been captured by the motivations and research direction of private corporations.
pdf
bib
abs
SemEval Task 1: Semantic Textual Relatedness for African and Asian Languages
Nedjma Ousidhoum
|
Shamsuddeen Hassan Muhammad
|
Mohamed Abdalla
|
Idris Abdulmumin
|
Ibrahim Said Ahmad
|
Sanchit Ahuja
|
Alham Fikri Aji
|
Vladimir Araujo
|
Meriem Beloucif
|
Christine De Kock
|
Oumaima Hourrane
|
Manish Shrivastava
|
Thamar Solorio
|
Nirmal Surange
|
Krishnapriya Vishnubhotla
|
Seid Muhie Yimam
|
Saif M. Mohammad
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.
2023
pdf
bib
abs
The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research
Mohamed Abdalla
|
Jan Philip Wahle
|
Terry Ruas
|
Aurélie Névéol
|
Fanny Ducel
|
Saif Mohammad
|
Karen Fort
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field.
pdf
bib
abs
We are Who We Cite: Bridges of Influence Between Natural Language Processing and Other Academic Fields
Jan Philip Wahle
|
Terry Ruas
|
Mohamed Abdalla
|
Bela Gipp
|
Saif Mohammad
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Natural Language Processing (NLP) is poised to substantially influence the world. However, significant progress comes hand-in-hand with substantial risks. Addressing them requires broad engagement with various fields of study. Yet, little empirical work examines the state of such engagement (past or current). In this paper, we quantify the degree of influence between 23 fields of study and NLP (on each other). We analyzed ~77k NLP papers, ~3.1m citations from NLP papers to other papers, and ~1.8m citations from other papers to NLP papers. We show that, unlike most fields, the cross-field engagement of NLP, measured by our proposed Citation Field Diversity Index (CFDI), has declined from 0.58 in 1980 to 0.31 in 2022 (an all-time low). In addition, we find that NLP has grown more insular—citing increasingly more NLP papers and having fewer papers that act as bridges between fields. NLP citations are dominated by computer science; Less than 8% of NLP citations are to linguistics, and less than 3% are to math and psychology. These findings underscore NLP’s urgent need to reflect on its engagement with various fields.
pdf
bib
abs
What Makes Sentences Semantically Related? A Textual Relatedness Dataset and Empirical Study
Mohamed Abdalla
|
Krishnapriya Vishnubhotla
|
Saif Mohammad
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. Additionally, automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity, a subset of relatedness, because of a lack of relatedness datasets. In this paper, we introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated using a comparative annotation framework, resulting in fine-grained scores. We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84. We use the dataset to explore questions on what makes sentences semantically related. We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks. Our dataset, data statement, and annotation questionnaire can be found at:
https://doi.org/10.5281/zenodo.7599667.
2020
pdf
bib
abs
Examining the rhetorical capacities of neural language models
Zining Zhu
|
Chuer Pan
|
Mohamed Abdalla
|
Frank Rudzicz
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Recently, neural language models (LMs) have demonstrated impressive abilities in generating high-quality discourse. While many recent papers have analyzed the syntactic aspects encoded in LMs, there has been no analysis to date of the inter-sentential, rhetorical knowledge. In this paper, we propose a method that quantitatively evaluates the rhetorical capacities of neural LMs. We examine the capacities of neural LMs understanding the rhetoric of discourse by evaluating their abilities to encode a set of linguistic features derived from Rhetorical Structure Theory (RST). Our experiments show that BERT-based LMs outperform other Transformer LMs, revealing the richer discourse knowledge in their intermediate layer representations. In addition, GPT-2 and XLNet apparently encode less rhetorical knowledge, and we suggest an explanation drawing from linguistic philosophy. Our method shows an avenue towards quantifying the rhetorical capacities of neural LMs.
2017
pdf
bib
abs
Cross-Lingual Sentiment Analysis Without (Good) Translation
Mohamed Abdalla
|
Graeme Hirst
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Current approaches to cross-lingual sentiment analysis try to leverage the wealth of labeled English data using bilingual lexicons, bilingual vector space embeddings, or machine translation systems. Here we show that it is possible to use a single linear transformation, with as few as 2000 word pairs, to capture fine-grained sentiment relationships between words in a cross-lingual setting. We apply these cross-lingual sentiment models to a diverse set of tasks to demonstrate their functionality in a non-English context. By effectively leveraging English sentiment knowledge without the need for accurate translation, we can analyze and extract features from other languages with scarce data at a very low cost, thus making sentiment and related analyses for many languages inexpensive.