Benchmark for Research Theme Classification of Scholarly Documents
Óscar E. Mendoza | Wojciech Kusa | Alaa El-Ebshihy | Ronin Wu | David Pride | Petr Knoth | Drahomira Herrmannova | Florina Piroi | Gabriella Pasi | Allan Hanbury
Proceedings of the Third Workshop on Scholarly Document Processing
We present a new gold-standard dataset and a benchmark for the Research Theme Identification task, a sub-task of the Scholarly Knowledge Graph Generation shared task, at the 3rd Workshop on Scholarly Document Processing. The objective of the shared task was to label given research papers with research themes from a total of 36 themes. The benchmark was compiled using data drawn from the largest overall assessment of university research output ever undertaken globally (the Research Excellence Framework - 2014). We provide a performance comparison of a transformer-based ensemble, which obtains multiple predictions for a research paper, given its multiple textual fields (e.g. title, abstract, reference), with traditional machine learning models. The ensemble involves enriching the initial data with additional information from open-access digital libraries and Argumentative Zoning techniques (CITATION). It uses a weighted sum aggregation for the multiple predictions to obtain a final single prediction for the given research paper. Both data and the ensemble are publicly available on https://www.kaggle.com/competitions/sdp2022-scholarly-knowledge-graph-generation/data?select=task1_test_no_label.csv and https://github.com/ProjectDoSSIER/sdp2022, respectively.
ARTU / TU Wien and Artificial Researcher@ LongSumm 20
Alaa El-Ebshihy | Annisa Maulida Ningtyas | Linda Andersson | Florina Piroi | Andreas Rauber
Proceedings of the First Workshop on Scholarly Document Processing
In this paper, we present our approach to solve the LongSumm 2020 Shared Task, at the 1st Workshop on Scholarly Document Processing. The objective of the long summaries task is to generate long summaries that cover salient information in scientific articles. The task is to generate abstractive and extractive summaries of a given scientific article. In the proposed approach, we are inspired by the concept of Argumentative Zoning (AZ) that de- fines the main rhetorical structure in scientific articles. We define two aspects that should be covered in scientific paper summary, namely Claim/Method and Conclusion/Result aspects. We use Solr index to expand the sentences of the paper abstract. We formulate each abstract sentence in a given publication as query to retrieve similar sentences from the text body of the document itself. We utilize a sentence selection algorithm described in previous literature to select sentences for the final summary that covers the two aforementioned aspects.
- Alaa El-Ebshihy 2
- Óscar E. Mendoza 1
- Wojciech Kusa 1
- Ronin Wu 1
- David Pride 1
- show all...