Mahmoud El-Haj

2022

pdf bib abs
CoFiF Plus: A French Financial Narrative Summarisation Corpus
Nadhem Zmandar | Tobias Daudert | Sina Ahmadi | Mahmoud El-Haj | Paul Rayson
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Natural Language Processing is increasingly being applied in the finance and business industry to analyse the text of many different types of financial documents. Given the increasing growth of firms around the world, the volume of financial disclosures and financial texts in different languages and forms is increasing sharply and therefore the study of language technology methods that automatically summarise content has grown rapidly into a major research area. Corpora for financial narrative summarisation exists in English, but there is a significant lack of financial text resources in the French language. To remedy this, we present CoFiF Plus, the first French financial narrative summarisation dataset providing a comprehensive set of financial text written in French. The dataset has been extracted from French financial reports published in PDF file format. It is composed of 1,703 reports from the most capitalised companies in France (Euronext Paris) covering a time frame from 1995 to 2021. This paper describes the collection, annotation and validation of the financial reports and their summaries. It also describes the dataset and gives the results of some baseline summarisers. Our datasets will be openly available upon the acceptance of the paper.

pdf bib abs
Introducing the Welsh Text Summarisation Dataset and Baseline Systems
Ignatius Ezeani | Mahmoud El-Haj | Jonathan Morris | Dawn Knight
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Welsh is an official language in Wales and is spoken by an estimated 884,300 people (29.2% of the population of Wales). Despite this status and estimated increase in speaker numbers since the last (2011) census, Welsh remains a minority language undergoing revitalisation and promotion by Welsh Government and relevant stakeholders. As part of the effort to increase the availability of Welsh digital technology, this paper introduces the first Welsh summarisation dataset, which we provide freely for research purposes to help advance the work on Welsh summarisation. The dataset was created by Welsh speakers through manually summarising Welsh Wikipedia articles. In addition, the paper discusses the implementation and evaluation of different summarisation systems for Welsh. The summarisation systems and results will serve as benchmarks for the development of summarisers in other minority language contexts.

pdf bib abs
IgboBERT Models: Building and Training Transformer Models for the Igbo Language
Chiamaka Chukwuneke | Ignatius Ezeani | Paul Rayson | Mahmoud El-Haj
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This work presents a standard Igbo named entity recognition (IgboNER) dataset as well as the results from training and fine-tuning state-of-the-art transformer IgboNER models. We discuss the process of our dataset creation - data collection and annotation and quality checking. We also present experimental processes involved in building an IgboBERT language model from scratch as well as fine-tuning it along with other non-Igbo pre-trained models for the downstream IgboNER task. Our results show that, although the IgboNER task benefited hugely from fine-tuning large transformer model, fine-tuning a transformer model built from scratch with comparatively little Igbo text data seems to yield quite decent results for the IgboNER task. This work will contribute immensely to IgboNLP in particular as well as the wider African and low-resource NLP efforts Keywords: Igbo, named entity recognition, BERT models, under-resourced, dataset

pdf bib abs
AraSAS: The Open Source Arabic Semantic Tagger
Mahmoud El-Haj | Elvis de Souza | Nouran Khallaf | Paul Rayson | Nizar Habash
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

This paper presents (AraSAS) the first open-source Arabic semantic analysis tagging system. AraSAS is a software framework that provides full semantic tagging of text written in Arabic. AraSAS is based on the UCREL Semantic Analysis System (USAS) which was first developed to semantically tag English text. Similarly to USAS, AraSAS uses a hierarchical semantic tag set that contains 21 major discourse fields and 232 fine-grained semantic field tags. The paper describes the creation, validation and evaluation of AraSAS. In addition, we demonstrate a first case study to illustrate the affordances of applying USAS and AraSAS semantic taggers on the Zayed University Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) (Palfreyman and Habash, 2022), where we show and compare the coverage of the two semantic taggers through running them on Arabic and English essays on different topics. The analysis expands to compare the taggers when run on texts in Arabic and English written by the same writer and texts written by male and by female students. Variables for comparison include frequency of use of particular semantic sub-domains, as well as the diversity of semantic elements within a text.

pdf bib abs
Improved Evaluation of Automatic Source Code Summarisation
Jesse Phillips | David Bowes | Mahmoud El-Haj | Tracy Hall
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

Source code summaries are a vital tool for the understanding and maintenance of source code as they can be used to explain code in simple terms. However, source code with missing, incorrect, or outdated summaries is a common occurrence in production code. Automatic source code summarisation seeks to solve these issues by generating up-to-date summaries of source code methods. Recent work in automatically generating source code summaries uses neural networks for generating summaries; commonly Sequence-to-Sequence or Transformer models, pretrained on method-summary pairs. The most common method of evaluating the quality of these summaries is comparing the machine-generated summaries against human-written summaries. Summaries can be evaluated using n-gram-based translation metrics such as BLEU, METEOR, or ROUGE-L. However, these metrics alone can be unreliable and new Natural Language Generation metrics based on large pretrained language models provide an alternative. In this paper, we propose a method of improving the evaluation of a model by improving the preprocessing of the data used to train it, as well as proposing evaluating the model with a metric based off a language model, pretrained on a Natural Language (English) alongside traditional metrics. Our evaluation suggests our model has been improved by cleaning and preprocessing the data used in model training. The addition of a pretrained language model metric alongside traditional metrics shows that both produce results which can be used to evaluate neural source code summarisation.

pdf bib abs
Creation of an Evaluation Corpus and Baseline Evaluation Scores for Welsh Text Summarisation
Mahmoud El-Haj | Ignatius Ezeani | Jonathan Morris | Dawn Knight
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

As part of the effort to increase the availability of Welsh digital technology, this paper introduces the first human vs metrics Welsh summarisation evaluation results and dataset, which we provide freely for research purposes to help advance the work on Welsh summarisation. The system summaries were created using an extractive graph-based Welsh summariser. The system summaries were evaluated by both human and a range of ROUGE metric variants (e.g. ROUGE 1, 2, L and SU4). The summaries and evaluation results will serve as benchmarks for the development of summarisers and evaluation metrics in other minority language contexts.

pdf bib
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022
Mahmoud El-Haj | Paul Rayson | Nadhem Zmandar
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

This paper presents the results and findings of the Financial Narrative Summarisation Shared Task on summarising UK, Greek and Spanish annual reports. The shared task was organised as part of the Financial Narrative Processing 2022 Workshop (FNP 2022 Workshop). The Financial Narrative summarisation Shared Task (FNS-2022) has been running since 2020 as part of the Financial Narrative Processing (FNP) workshop series (El-Haj et al., 2022; El-Haj et al., 2021; El-Haj et al., 2020b; El-Haj et al., 2019c; El-Haj et al., 2018). The shared task included one main task which is the use of either abstractive or extractive automatic summarisers to summarise long documents in terms of UK, Greek and Spanish financial annual reports. This shared task is the third to target financial documents. The data for the shared task was created and collected from publicly available annual reports published by firms listed on the Stock Exchanges of UK, Greece and Spain. A total number of 14 systems from 7 different teams participated in the shared task.

pdf bib abs
Financial Narrative Summarisation Using a Hybrid TF-IDF and Clustering Summariser: AO-Lancs System at FNS 2022
Mahmoud El-Haj | Andrew Ogden
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

This paper describes the HTAC system submitted to the Financial Narrative Summarization Shared Task (FNS-2022). A methodology implementing Financial narrative Processing (FNP) to summarise financial annual reports, named Hybrid TF-IDF and Clustering (HTAC). This involves a hybrid approach combining TF-IDF sentence ranking as an NLP tool with a state-of-the-art Clustering Machine learning model to produce short 1000-word summaries of long financial annual reports. These Annual Reports are a legal responsibility of public companies and are in excess of 50,000 words. The model extracts the crucial information from these documents, discarding the extraneous content, leaving only the crucial information in a shorter, non-redundant summary. Producing summaries that are more effective than summaries produced by two pre-existing generic summarisers.

This paper describes the FinTOC-2022 Shared Task on the structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 4th Financial Narrative Processing Workshop (FNP 2022), held jointly at The 13th Edition of the Language Resources and Evaluation Conference (LREC 2022), Marseille, France (El-Haj et al., 2022). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the forth edition of this shared task, three subtasks were presented to the participants: one with English documents, one with French documents and the other one with Spanish documents. This year, we proposed a different and revised dataset for English and French compared to the previous editions of FinTOC and a new dataset for Spanish documents was added. The task attracted 6 submissions for each language from 4 teams, and the most successful methods make use of textual, structural and visual features extracted from the documents and propose classification models for detecting titles and TOCs for all of the subtasks.

pdf bib abs
The Financial Causality Extraction Shared Task (FinCausal 2022)
Dominique Mariko | Hanna Abi-Akl | Kim Trottier | Mahmoud El-Haj
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

We present the FinCausal 2020 Shared Task on Causality Detection in Financial Documents and the associated FinCausal dataset, and discuss the participating systems and results. The task focuses on detecting if an object, an event or a chain of events is considered a cause for a prior event. This shared task focuses on determining causality associated with a quantified fact. An event is defined as the arising or emergence of a new object or context in regard to a previous situation. Therefore, the task will emphasise the detection of causality associated with transformation of financial objects embedded in quantified facts. A total number of 7 teams submitted system runs to the FinCausal task and contributed with a system description paper. FinCausal shared task is associated with the 4th Financial Narrative Processing Workshop (FNP 2022) (El-Haj et al., 2022) which is held at the The 13th Language Resources and Evaluation Conference (LREC 2022) in Marseille, France, on June 24, 2022.

2021

pdf bib
Proceedings of the 3rd Financial Narrative Processing Workshop
Mahmoud El-Haj | Paul Rayson | Nadhem Zmandar
Proceedings of the 3rd Financial Narrative Processing Workshop

pdf bib
Joint abstractive and extractive method for long financial document summarization
Nadhem Zmandar | Abhishek Singh | Mahmoud El-Haj | Paul Rayson
Proceedings of the 3rd Financial Narrative Processing Workshop

2020

pdf bib abs
The Financial Narrative Summarisation Shared Task (FNS 2020)
Mahmoud El-Haj | Ahmed AbuRa’ed | Marina Litvak | Nikiforos Pittaras | George Giannakopoulos
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

This paper presents the results and findings of the Financial Narrative Summarisation shared task (FNS 2020) on summarising UK annual reports. The shared task was organised as part of the 1st Financial Narrative Processing and Financial Narrative Summarisation Workshop (FNP-FNS 2020). The shared task included one main task which is the use of either abstractive or extractive summarisation methodologies and techniques to automatically summarise UK financial annual reports. FNS summarisation shared task is the first to target financial annual reports. The data for the shared task was created and collected from publicly available UK annual reports published by firms listed on the London Stock Exchange (LSE). A total number of 24 systems from 9 different teams participated in the shared task. In addition we had 2 baseline summarisers and additional 2 topline summarisers to help evaluate and compare against the results of the participants.

pdf bib abs
The Financial Document Structure Extraction Shared task (FinToc 2020)
Najah-Imane Bentabet | Rémi Juge | Ismail El Maarouf | Virginie Mouilleron | Dialekti Valsamou-Stanislawski | Mahmoud El-Haj
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

This paper presents the FinTOC-2020 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING’2020). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the second edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents.

pdf bib abs
The Financial Document Causality Detection Shared Task (FinCausal 2020)
Dominique Mariko | Hanna Abi-Akl | Estelle Labidurie | Stephane Durfort | Hugues De Mazancourt | Mahmoud El-Haj
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

We present the FinCausal 2020 Shared Task on Causality Detection in Financial Documents and the associated FinCausal dataset, and discuss the participating systems and results. Two sub-tasks are proposed: a binary classification task (Task 1) and a relation extraction task (Task 2). A total of 16 teams submitted runs across the two Tasks and 13 of them contributed with a system description paper. This workshop is associated to the Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING’2020), Barcelona, Spain on September 12, 2020.

pdf bib abs
Habibi - a multi Dialect multi National Arabic Song Lyrics Corpus
Mahmoud El-Haj
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces Habibi the first Arabic Song Lyrics corpus. The corpus comprises more than 30,000 Arabic song lyrics in 6 Arabic dialects for singers from 18 different Arabic countries. The lyrics are segmented into more than 500,000 sentences (song verses) with more than 3.5 million words. I provide the corpus in both comma separated value (csv) and annotated plain text (txt) file formats. In addition, I converted the csv version into JavaScript Object Notation (json) and eXtensible Markup Language (xml) file formats. To experiment with the corpus I run extensive binary and multi-class experiments for dialect and country-of-origin identification. The identification tasks include the use of several classical machine learning and deep learning models utilising different word embeddings. For the binary dialect identification task the best performing classifier achieved a testing accuracy of 93%. This was achieved using a word-based Convolutional Neural Network (CNN) utilising a Continuous Bag of Words (CBOW) word embeddings model. The results overall show all classical and deep learning models to outperform our baseline, which demonstrates the suitability of the corpus for both dialect and country-of-origin identification tasks. I am making the corpus and the trained CBOW word embeddings freely available for research purposes.

We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words.

2019

pdf bib
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics
Mahmoud El-Haj | Paul Rayson | Eric Atwell | Lama Alsudias
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics

pdf bib
Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019)
Mahmoud El-Haj | Paul Rayson | Steven Young | Houda Bouamor | Sira Ferradans
Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019)

pdf bib abs
MultiLing 2019: Financial Narrative Summarisation
Mahmoud El-Haj
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

The Financial Narrative Summarisation task at MultiLing 2019 aims to demonstrate the value and challenges of applying automatic text summarisation to financial text written in English, usually referred to as financial narrative disclosures. The task dataset has been extracted from UK annual reports published in PDF file format. The participants were asked to provide structured summaries, based on real-world, publicly available financial annual reports of UK firms by extracting information from different key sections. Participants were asked to generate summaries that reflects the analysis and assessment of the financial trend of the business over the past year, as provided by annual reports. The evaluation of the summaries was performed using AutoSummENG and Rouge automatic metrics. This paper focuses mainly on the data creation process.

2018

pdf bib
Arabic Dialect Identification in the Context of Bivalency and Code-Switching
Mahmoud El-Haj | Paul Rayson | Mariam Aboelezz
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger
Mahmoud El-Haj | Paul Rayson | Scott Piao | Jo Knight
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Creating and Validating Multilingual Semantic Representations for Six Languages: Expert versus Non-Expert Crowds
Mahmoud El-Haj | Paul Rayson | Scott Piao | Stephen Wattam
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

Creating high-quality wide-coverage multilingual semantic lexicons to support knowledge-based approaches is a challenging time-consuming manual task. This has traditionally been performed by linguistic experts: a slow and expensive process. We present an experiment in which we adapt and evaluate crowdsourcing methods employing native speakers to generate a list of coarse-grained senses under a common multilingual semantic taxonomy for sets of words in six languages. 451 non-experts (including 427 Mechanical Turk workers) and 15 expert participants semantically annotated 250 words manually for Arabic, Chinese, English, Italian, Portuguese and Urdu lexicons. In order to avoid erroneous (spam) crowdsourced results, we used a novel task-specific two-phase filtering process where users were asked to identify synonyms in the target language, and remove erroneous senses.

2016

pdf bib abs
OSMAN ― A Novel Arabic Readability Metric
Mahmoud El-Haj | Paul Rayson
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text with and without diacritics. OSMAN is a modified version of the conventional readability formulas such as Flesch and Fog. In our work we introduce a novel approach towards counting short, long and stress syllables in Arabic which is essential for judging readability of Arabic narratives. We also introduce an additional factor called “Faseeh” which considers aspects of script usually dropped in informal Arabic writing. To evaluate our methods we used Spearman’s correlation metric to compare text readability for 73,000 parallel sentences from English and Arabic UN documents. The Arabic sentences were written with the absence of diacritics and in order to count the number of syllables we added the diacritics in using an open source tool called Mishkal. The results show that OSMAN readability formula correlates well with the English ones making it a useful tool for researchers and educators working with Arabic text.

Attribution bias refers to the tendency of people to attribute successes to their own abilities but failures to external factors. In a business context an internal factor might be the restructuring of the firm and an external factor might be an unfavourable change in exchange or interest rates. In accounting research, the presence of an attribution bias has been demonstrated for the narrative sections of the annual financial reports. Previous studies have applied manual content analysis to this problem but in this paper we present novel work to automate the analysis of attribution bias through using machine learning algorithms. Previous studies have only applied manual content analysis on a small scale to reveal such a bias in the narrative section of annual financial reports. In our work a group of experts in accounting and finance labelled and annotated a list of 32,449 sentences from a random sample of UK Preliminary Earning Announcements (PEAs) to allow us to examine whether sentences in PEAs contain internal or external attribution and which kinds of attributions are linked to positive or negative performance. We wished to examine whether human annotators could agree on coding this difficult task and whether Machine Learning (ML) could be applied reliably to replicate the coding process on a much larger scale. Our best machine learning algorithm correctly classified performance sentences with 70% accuracy and detected tone and attribution in financial PEAs with accuracy of 79%.

The last two decades have seen the development of various semantic lexical resources such as WordNet (Miller, 1995) and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Recently, increasing efforts have been devoted to extending the semantic frameworks of existing lexical knowledge resources to cover more languages, such as EuroWordNet and Global WordNet. In this paper, we report on the construction of large-scale multilingual semantic lexicons for twelve languages, which employ the unified Lancaster semantic taxonomy and provide a multilingual lexical knowledge base for the automatic UCREL semantic annotation system (USAS). Our work contributes towards the goal of constructing larger-scale and higher-quality multilingual semantic lexical resources and developing corpus annotation tools based on them. Lexical coverage is an important factor concerning the quality of the lexicons and the performance of the corpus annotation tools, and in this experiment we focus on evaluating the lexical coverage achieved by the multilingual lexicons and semantic annotation tools based on them. Our evaluation shows that some semantic lexicons such as those for Finnish and Italian have achieved lexical coverage of over 90% while others need further expansion.

2014

pdf bib abs
Detecting Document Structure in a Very Large Corpus of UK Financial Reports
Mahmoud El-Haj | Paul Rayson | Steve Young | Martin Walker
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the evaluation of our automatic methods for detecting and extracting document structure in annual financial reports. The work presented is part of the Corporate Financial Information Environment (CFIE) project in which we are using Natural Language Processing (NLP) techniques to study the causes and consequences of corporate disclosure and financial reporting outcomes. We aim to uncover the determinants of financial reporting quality and the factors that influence the quality of information disclosed to investors beyond the financial statements. The CFIE consists of the supply of information by firms to investors, and the mediating influences of information intermediaries on the timing, relevance and reliability of information available to investors. It is important to compare and contrast specific elements or sections of each annual financial report across our entire corpus rather than working at the full document level. We show that the values of some metrics e.g. readability will vary across sections, thus improving on previous research research based on full texts.

2013

pdf bib
Multi-document multilingual summarization corpus preparation, Part 1: Arabic, English, Greek, Chinese, Romanian
Lei Li | Corina Forascu | Mahmoud El-Haj | George Giannakopoulos
Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization

pdf bib
Using a Keyness Metric for Single and Multi Document Summarisation
Mahmoud El-Haj | Paul Rayson
Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization

2012

pdf bib abs
Assessing Crowdsourcing Quality through Objective Tasks
Ahmet Aker | Mahmoud El-Haj | M-Dyaa Albakour | Udo Kruschwitz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The emergence of crowdsourcing as a commonly used approach to collect vast quantities of human assessments on a variety of tasks represents nothing less than a paradigm shift. This is particularly true in academic research where it has suddenly become possible to collect (high-quality) annotations rapidly without the need of an expert. In this paper we investigate factors which can influence the quality of the results obtained through Amazon's Mechanical Turk crowdsourcing platform. We investigated the impact of different presentation methods (free text versus radio buttons), workers' base (USA versus India as the main bases of MTurk workers) and payment scale (about $4, $8 and $10 per hour) on the quality of the results. For each run we assessed the results provided by 25 workers on a set of 10 tasks. We run two different experiments using objective tasks: maths and general text questions. In both tasks the answers are unique, which eliminates the uncertainty usually present in subjective tasks, where it is not clear whether the unexpected answer is caused by a lack of worker's motivation, the worker's interpretation of the task or genuine ambiguity. In this work we present our results comparing the influence of the different factors used. One of the interesting findings is that our results do not confirm previous studies which concluded that an increase in payment attracts more noise. We also find that the country of origin only has an impact in some of the categories and only in general text questions but there is no significant difference at the top pay.