George Tsatsaronis


2022

pdf bib
Overview of the DAGPap22 Shared Task on Detecting Automatically Generated Scientific Papers
Yury Kashnitsky | Drahomira Herrmannova | Anita de Waard | George Tsatsaronis | Catriona Catriona Fennell | Cyril Labbe
Proceedings of the Third Workshop on Scholarly Document Processing

This paper provides an overview of the DAGPap22 shared task on the detection of automatically generated scientific papers at the Scholarly Document Process workshop colocated with COLING. We frame the detection problem as a binary classification task: given an excerpt of text, label it as either human-written or machine-generated. We shared a dataset containing excerpts from human-written papers as well as artificially generated content and suspicious documents collected by Elsevier publishing and editorial teams. As a test set, the participants are provided with a 5x larger corpus of openly accessible human-written as well as generated papers from the same scientific domains of documents. The shared task saw 180 submissions across 14 participating teams and resulted in two published technical reports. We discuss our findings from the shared task in this overview paper.

pdf bib
Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases
Gizem Aydin | Seyed Amin Tabatabaei | George Tsatsaronis | Faegheh Hasibi
Proceedings of the 29th International Conference on Computational Linguistics

Automatic extraction of funding information from academic articles adds significant value to industry and research communities, including tracking research outcomes by funding organizations, profiling researchers and universities based on the received funding, and supporting open access policies. Two major challenges of identifying and linking funding entities are: (i) sparse graph structure of the Knowledge Base (KB), which makes the commonly used graph-based entity linking approaches suboptimal for the funding domain, (ii) missing entities in KB, which (unlike recent zero-shot approaches) requires marking entity mentions without KB entries as NIL. We propose an entity linking model that can perform NIL prediction and overcome data scarcity issues in a time and data-efficient manner. Our model builds on a transformer-based mention detection and a bi-encoder model to perform entity linking. We show that our model outperforms strong existing baselines.

2020

pdf bib
CORA: A Deep Active Learning Covid-19 Relevancy Algorithm to Identify Core Scientific Articles
Zubair Afzal | Vikrant Yadav | Olga Fedorova | Vaishnavi Kandala | Janneke van de Loo | Saber A. Akhondi | Pascal Coupet | George Tsatsaronis
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Ever since the COVID-19 pandemic broke out, the academic and scientific research community, as well as industry and governments around the world have joined forces in an unprecedented manner to fight the threat. Clinicians, biologists, chemists, bioinformaticians, nurses, data scientists, and all of the affiliated relevant disciplines have been mobilized to help discover efficient treatments for the infected population, as well as a vaccine solution to prevent further the virus spread. In this combat against the virus responsible for the pandemic, key for any advancements is the timely, accurate, peer-reviewed, and efficient communication of any novel research findings. In this paper we present a novel framework to address the information need of filtering efficiently the scientific bibliography for relevant literature around COVID-19. The contributions of the paper are summarized in the following: we define and describe the information need that encompasses the major requirements for COVID-19 articles relevancy, we present and release an expert-curated benchmark set for the task, and we analyze the performance of several state-of-the-art machine learning classifiers that may distinguish the relevant from the non-relevant COVID-19 literature.

2019

pdf bib
EigenSent: Spectral sentence embeddings using higher-order Dynamic Mode Decomposition
Subhradeep Kayal | George Tsatsaronis
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Distributed representation of words, or word embeddings, have motivated methods for calculating semantic representations of word sequences such as phrases, sentences and paragraphs. Most of the existing methods to do so either use algorithms to learn such representations, or improve on calculating weighted averages of the word vectors. In this work, we experiment with spectral methods of signal representation and summarization as mechanisms for constructing such word-sequence embeddings in an unsupervised fashion. In particular, we explore an algorithm rooted in fluid-dynamics, known as higher-order Dynamic Mode Decomposition, which is designed to capture the eigenfrequencies, and hence the fundamental transition dynamics, of periodic and quasi-periodic systems. It is empirically observed that this approach, which we call EigenSent, can summarize transitions in a sequence of words and generate an embedding that can represent well the sequence itself. To the best of the authors’ knowledge, this is the first application of a spectral decomposition and signal summarization technique on text, to create sentence embeddings. We test the efficacy of this algorithm in creating sentence embeddings on three public datasets, where it performs appreciably well. Moreover it is also shown that, due to the positive combination of their complementary properties, concatenating the embeddings generated by EigenSent with simple word vector averaging achieves state-of-the-art results.

2018

pdf bib
Novelty Goes Deep. A Deep Neural Solution To Document Level Novelty Detection
Tirthankar Ghosal | Vignesh Edithal | Asif Ekbal | Pushpak Bhattacharyya | George Tsatsaronis | Srinivasa Satya Sameer Kumar Chivukula
Proceedings of the 27th International Conference on Computational Linguistics

The rapid growth of documents across the web has necessitated finding means of discarding redundant documents and retaining novel ones. Capturing redundancy is challenging as it may involve investigating at a deep semantic level. Techniques for detecting such semantic redundancy at the document level are scarce. In this work we propose a deep Convolutional Neural Networks (CNN) based model to classify a document as novel or redundant with respect to a set of relevant documents already seen by the system. The system is simple and do not require any manual feature engineering. Our novel scheme encodes relevant and relative information from both source and target texts to generate an intermediate representation which we coin as the Relative Document Vector (RDV). The proposed method outperforms the existing state-of-the-art on a document-level novelty detection dataset by a margin of ∼5% in terms of accuracy. We further demonstrate the effectiveness of our approach on a standard paraphrase detection dataset where paraphrased passages closely resemble to semantically redundant documents.

2017

pdf bib
Tagging Funding Agencies and Grants in Scientific Articles using Sequential Learning Models
Subhradeep Kayal | Zubair Afzal | George Tsatsaronis | Sophia Katrenko | Pascal Coupet | Marius Doornenbal | Michelle Gregory
BioNLP 2017

In this paper we present a solution for tagging funding bodies and grants in scientific articles using a combination of trained sequential learning models, namely conditional random fields (CRF), hidden markov models (HMM) and maximum entropy models (MaxEnt), on a benchmark set created in-house. We apply the trained models to address the BioASQ challenge 5c, which is a newly introduced task that aims to solve the problem of funding information extraction from scientific articles. Results in the dry-run data set of BioASQ task 5c show that the suggested approach can achieve a micro-recall of more than 85% in tagging both funding bodies and grants.

2010

pdf bib
SemanticRank: Ranking Keywords and Sentences Using Semantic Graphs
George Tsatsaronis | Iraklis Varlamis | Kjetil Nørvåg
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness
George Tsatsaronis | Vicky Panagiotopoulou
Proceedings of the Student Research Workshop at EACL 2009