Serge Sharoff


2025

pdf bib
Controlling Out-of-Domain Gaps in LLMs for Genre Classification and Generated Text Detection
Dmitri Roussinov | Serge Sharoff | Nadezhda Puchnina
Proceedings of the 31st International Conference on Computational Linguistics

This study demonstrates that the modern generation of Large Language Models (LLMs, such as GPT-4) suffers from the same out-of-domain (OOD) performance gap observed in prior research on pre-trained Language Models (PLMs, such as BERT). We demonstrate this across two non-topical classification tasks: (1) genre classification and (2) generated text detection. Our results show that when demonstration examples for In-Context Learning (ICL) come from one domain (e.g., travel) and the system is tested on another domain (e.g., history), classification performance declines significantly. To address this, we introduce a method that controls which predictive indicators are used and which are excluded during classification. For the two tasks studied here, this ensures that topical features are omitted, while the model is guided to focus on stylistic rather than content-based attributes. This approach reduces the OOD gap by up to 20 percentage points in a few-shot setup. Straightforward Chain-of-Thought (CoT) methods, used as the baseline, prove insufficient, while our approach consistently enhances domain transfer performance.

pdf bib
BERT-based Classical Arabic Poetry Authorship Attribution
Lama Alqurashi | Serge Sharoff | Janet Watson | Jacob Blakesley
Proceedings of the 31st International Conference on Computational Linguistics

This study introduces a novel computational approach to authorship attribution (AA) in Arabic poetry, using the entire Classical Arabic Poetry corpus for the first time and offering a direct analysis of real cases of misattribution. AA in Arabic poetry has been a significant issue since the 9th century, particularly due to the loss of pre-Islamic poetry and the misattribution of post-Islamic works to earlier poets. While previous research has predominantly employed qualitative methods, this study uses computational techniques to address these challenges. The corpus was scraped from online sources and enriched with manually curated Date of Death (DoD) information to overcome the problematic traditional sectioning. Additionally, we applied Embedded Topic Modeling (ETM) to label each poem with its topic contributions, further enhancing the dataset’s value. An ensemble model based on CAMeLBERT was developed and tested across three dimensions: topic, number of poets, and number of training examples. After parameter optimization, the model achieved F1 scores ranging from 0.97 to 1.0. The model was also applied to four pre-Islamic misattribution cases, producing results consistent with historical and literary studies.

2024

pdf bib
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

pdf bib
Quantifying the Contribution of MWEs and Polysemy in Translation Errors for English–Igbo MT
Adaeze Ohuoba | Serge Sharoff | Callum Walker
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

In spite of recent successes in improving Machine Translation (MT) quality overall, MT engines require a large amount of resources, which leads to markedly lower quality for lesser-resourced languages. This study explores the case of translation from English into Igbo, a very low resource language spoken by about 45 million speakers. With the aim of improving MT quality in this scenario, we investigate methods for guided detection of critical/harmful MT errors, more specifically those caused by non-compositional multi-word expressions and polysemy. We have designed diagnostic tests for these cases and applied them to collections of medical texts from CDC, Cochrane, NCDC, NHS and WHO.

pdf bib
Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning
Nurbanu Aksoy | Nishant Ravikumar | Serge Sharoff
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Image-to-text generation involves automatically generating descriptive text from images and has applications in medical report generation. However, traditional approaches often exhibit a semantic gap between visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning, especially with text generation prioritisation, improved performance over single-task baselines across language generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models. Qualitative analysis showed logically coherent narratives and accurate identification of findings, though some repetition and disjointed phrasing remained. This work demonstrates the benefits of multi-modal, multi-task learning for image-to-text generation applications.

2023

pdf bib
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)
Amal Haddad Haddad | Ayla Rigouts Terryn | Ruslan Mitkov | Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the Workshop on Computational Terminology in NLP and Translation Studies (ConTeNTS) Incorporating the 16th Workshop on Building and Using Comparable Corpora (BUCC)

pdf bib
BERT Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification
Dmitri Roussinov | Serge Sharoff
Findings of the Association for Computational Linguistics: EMNLP 2023

While performance of many text classification tasks has been recently improved due to Pretrained Language Models (PLMs), in this paper we show that they still suffer from a performance gap when the underlying distribution of topics changes. For example, a genre classifier trained on political topics often fails when tested on documents in the same genre, but about sport or medicine. In this work, we quantify this phenomenon empirically with a large corpus and a large set of topics. Thus, we verify that domain transfer remains challenging both for classic PLMs, such as BERT, and for modern large models (LLMs), such as GPT. We develop a data augmentation approach by generating texts in any desired genre and on any desired topic, even when there are no documents in the training corpus that are both in that particular genre and on that particular topic. When we augment the training dataset with the topically-controlled synthetic texts, F1 improves up to 50% for some topics, approaching on-topic training, while showing no or next to no improvement for other topics. While our empirical results focus on genre classification, our methodology is applicable to other classification tasks such as gender, authorship, or sentiment classification.

pdf bib
FTD at SemEval-2023 Task 3: News Genre and Propaganda Detection by Comparing Mono- and Multilingual Models with Fine-tuning on Additional Data
Mikhail Lepekhin | Serge Sharoff
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

We report our participation in the SemEval-2023 shared task on propaganda detection and describe our solutions with pre-trained models and their ensembles. For Subtask 1 (News Genre Categorisation), we report the impact of several settings, such as the choice of the classification models (monolingual or multilingual or their ensembles), the choice of the training sets (base or additional sources), the impact of detection certainty in making a classification decision as well as the impact of other hyper-parameters. In particular, we fine-tune models on additional data for other genre classification tasks, such as FTD. We also try adding texts from genre-homogenous corpora, such as Panorama, Babylon Bee for satire and Giganews for for reporting texts. We also make prepared models for Subtasks 2 and 3 with finetuning the corresponding models first for Subtask 1.The code needed to reproduce the experiments is available.

2022

pdf bib
Proceedings of the BUCC Workshop within LREC 2022
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the BUCC Workshop within LREC 2022

pdf bib
Applying Natural Annotation and Curriculum Learning to Named Entity Recognition for Under-Resourced Languages
Valeriy Lobov | Alexandra Ivoylova | Serge Sharoff
Proceedings of the 29th International Conference on Computational Linguistics

Current practices in building new NLP models for low-resourced languages rely either on Machine Translation of training sets from better resourced languages or on cross-lingual transfer from them. Still we can see a considerable performance gap between the models originally trained within better resourced languages and the models transferred from them. In this study we test the possibility of (1) using natural annotation to build synthetic training sets from resources not initially designed for the target downstream task and (2) employing curriculum learning methods to select the most suitable examples from synthetic training sets. We test this hypothesis across seven Slavic languages and across three curriculum learning strategies on Named Entity Recognition as the downstream task. We also test the possibility of fine-tuning the synthetic resources to reflect linguistic properties, such as the grammatical case and gender, both of which are important for the Slavic languages. We demonstrate the possibility to achieve the mean F1 score of 0.78 across the three basic entities types for Belarusian starting from zero resources in comparison to the baseline of 0.63 using the zero-shot transfer from English. For comparison, the English model trained on the original set achieves the mean F1-score of 0.75. The experimental results are available from https://github.com/ValeraLobov/SlavNER

pdf bib
Multimodal Pipeline for Collection of Misinformation Data from Telegram
Jose Sosa | Serge Sharoff
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper presents the outcomes of AI-COVID19, our project aimed at better understanding of misinformation flow about COVID-19 across social media platforms. The specific focus of the study reported in this paper is on collecting data from Telegram groups which are active in promotion of COVID-related misinformation. Our corpus collected so far contains around 28 million words, from almost one million messages. Given that a substantial portion of misinformation flow in social media is spread via multimodal means, such as images and video, we have also developed a mechanism for utilising such channels via producing automatic transcripts for videos and automatic classification for images into such categories as memes, screenshots of posts and other kinds of images. The accuracy of the image classification pipeline is around 87%.

pdf bib
Estimating Confidence of Predictions of Individual Classifiers and TheirEnsembles for the Genre Classification Task
Mikhail Lepekhin | Serge Sharoff
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Genre identification is a kind of non-topic text classification. The main difference between this task and topic classification is that genre, unlike topic, usually cannot be expressed just by some keywords and is defined as a functional space. Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification. However, in many cases, their downstream application to very large corpora, such as those extracted from social media, can lead to unreliable results because of dataset shifts, when some raw texts do not match the profile of the training set. To mitigate this problem, we experiment with individual models as well as with their ensembles. To evaluate the robustness of all models we use a prediction confidence metric, which estimates the reliability of a prediction in the absence of a gold standard label. We can evaluate robustness via the confidence gap between the correctly classified texts and the misclassified ones on a labeled test corpus, higher gaps make it easier to identify whether a text is classified correctly. Our results show that for all of the classifiers tested in this study, there is a confidence gap, but for the ensembles, the gap is wider, meaning that ensembles are more robust than their individual models.

pdf bib
BERTology for Machine Translation: What BERT Knows about Linguistic Difficulties for Translation
Yuqian Dai | Marc de Kamps | Serge Sharoff
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Pre-trained transformer-based models, such as BERT, have shown excellent performance in most natural language processing benchmark tests, but we still lack a good understanding of the linguistic knowledge of BERT in Neural Machine Translation (NMT). Our work uses syntactic probes and Quality Estimation (QE) models to analyze the performance of BERT’s syntactic dependencies and their impact on machine translation quality, exploring what kind of syntactic dependencies are difficult for NMT engines based on BERT. While our probing experiments confirm that pre-trained BERT “knows” about syntactic dependencies, its ability to recognize them often decreases after fine-tuning for NMT tasks. We also detect a relationship between syntactic dependencies in three languages and the quality of their translations, which shows which specific syntactic dependencies are likely to be a significant cause of low-quality translations.

pdf bib
Towards Arabic Sentence Simplification via Classification and Generative Approaches
Nouran Khallaf | Serge Sharoff | Rasha Soliman
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

This paper presents an attempt to build a Modern Standard Arabic (MSA) sentence-level simplification system. We experimented with sentence simplification using two approaches: (i) a classification approach leading to lexical simplification pipelines which use Arabic-BERT, a pre-trained contextualised model, as well as a model of fastText word embeddings; and (ii) a generative approach, a Seq2Seq technique by applying a multilingual Text-to-Text Transfer Transformer mT5. We developed our training corpus by aligning the original and simplified sentences from the internationally acclaimed Arabic novel Saaq al-Bambuu. We evaluate effectiveness of these methods by comparing the generated simple sentences to the target simple sentences using the BERTScore evaluation metric. The simple sentences produced by the mT5 model achieve P 0.72, R 0.68 and F-1 0.70 via BERTScore, while, combining Arabic-BERT and fastText achieves P 0.97, R 0.97 and F-1 0.97. In addition, we report a manual error analysis for these experiments.

2021

pdf bib
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Reinhard Rapp | Serge Sharoff | Pierre Zweigenbaum
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

pdf bib
Automatic Difficulty Classification of Arabic Sentences
Nouran Khallaf | Serge Sharoff
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we present a Modern Standard Arabic (MSA) Sentence difficulty classifier, which predicts the difficulty of sentences for language learners using either the CEFR proficiency levels or the binary classification as simple or complex. We compare the use of sentence embeddings of different kinds (fastText, mBERT , XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. Our best results have been achieved using fined-tuned Arabic-BERT. The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression. Our binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for sentence-pair semantic similarity classifier.

2020

pdf bib
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

pdf bib
Overview of the Fourth BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

The shared task of the 13th Workshop on Building and Using Comparable Corpora was devoted to the induction of bilingual dictionaries from comparable rather than parallel corpora. In this task, for a number of language pairs involving Chinese, English, French, German, Russian and Spanish, the participants were supposed to determine automatically the target language translations of several thousand source language test words of three frequency ranges. We describe here some background, the task definition, the training and test data sets and the evaluation used for ranking the participating systems. We also summarize the approaches used and present the results of the evaluation. In conclusion, the outcome of the competition are the results of a number of systems which provide surprisingly good solutions to the ambitious problem.

pdf bib
Sentence Level Human Translation Quality Estimation with Attention-based Neural Networks
Yu Yuan | Serge Sharoff
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper explores the use of Deep Learning methods for automatic estimation of quality of human translations. Automatic estimation can provide useful feedback for translation teaching, examination and quality control. Conventional methods for solving this task rely on manually engineered features and external knowledge. This paper presents an end-to-end neural model without feature engineering, incorporating a cross attention mechanism to detect which parts in sentence pairs are most relevant for assessing quality. Another contribution concerns oprediction of fine-grained scores for measuring different aspects of translation quality, such as terminological accuracy or idiomatic writing. Empirical results on a large human annotated dataset show that the neural model outperforms feature-based methods significantly. The dataset and the tools are available.

pdf bib
Know thy Corpus! Robust Methods for Digital Curation of Web corpora
Serge Sharoff
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsupervised topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.

pdf bib
Recognizing Semantic Relations by Combining Transformers and Fully Connected Models
Dmitri Roussinov | Serge Sharoff | Nadezhda Puchnina
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatically recognizing an existing semantic relation (e.g. “is a”, “part of”, “property of”, “opposite of” etc.) between two words (phrases, concepts, etc.) is an important task affecting many NLP applications and has been subject of extensive experimentation and modeling. Current approaches to automatically telling if a relation exists between two given concepts X and Y can be grouped into two types: 1) those modeling word-paths connecting X and Y in text and 2) those modeling distributional properties of X and Y separately, not necessary in the proximity to each other. Here, we investigate how both types can be improved and combined. We suggest a distributional approach that is based on an attention-based transformer. We have also developed a novel word path model that combines useful properties of a convolutional network with a fully connected language model. While our transformer-based approach works better, both our models significantly outperform the state-of-the-art within their classes of approaches. We also demonstrate that combining the two approaches results in additional gains since they use somewhat different data sources.

2019

pdf bib
Towards Functionally Similar Corpus Resources for Translation
Maria Kunilovskaya | Serge Sharoff
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The paper describes a computational approach to produce functionally comparable monolingual corpus resources for translation studies and contrastive analysis. We exploit a text-external approach, based on a set of Functional Text Dimensions to model text functions, so that each text can be represented as a vector in a multidimensional space of text functions. These vectors can be used to find reasonably homogeneous subsets of functionally similar texts across different corpora. Our models for predicting text functions are based on recurrent neural networks and traditional feature-based machine learning approaches. In addition to using the categories of the British National Corpus as our test case, we investigated the functional comparability of the English parts from the two parallel corpora: CroCo (English-German) and RusLTC (English-Russian) and applied our models to define functionally similar clusters in them. Our results show that the Functional Text Dimensions provide a useful description for text categories, while allowing a more flexible representation for texts with hybrid functions.

2018

pdf bib
Language adaptation experiments via cross-lingual embeddings for related languages
Serge Sharoff
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Investigating the Influence of Bilingual MWU on Trainee Translation Quality
Yu Yuan | Serge Sharoff
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cross-lingual Terminology Extraction for Translation Quality Estimation
Yu Yuan | Yuze Gao | Yue Zhang | Serge Sharoff
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Toward Pan-Slavic NLP: Some Experiments with Language Adaptation
Serge Sharoff
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

There is great variation in the amount of NLP resources available for Slavonic languages. For example, the Universal Dependency treebank (Nivre et al., 2016) has about 2 MW of training resources for Czech, more than 1 MW for Russian, while only 950 words for Ukrainian and nothing for Belorussian, Bosnian or Macedonian. Similarly, the Autodesk Machine Translation dataset only covers three Slavonic languages (Czech, Polish and Russian). In this talk I will discuss a general approach, which can be called Language Adaptation, similarly to Domain Adaptation. In this approach, a model for a particular language processing task is built by lexical transfer of cognate words and by learning a new feature representation for a lesser-resourced (recipient) language starting from a better-resourced (donor) language. More specifically, I will demonstrate how language adaptation works in such training scenarios as Translation Quality Estimation, Part-of-Speech tagging and Named Entity Recognition.

pdf bib
Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

pdf bib
Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise French-English runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task.

2016

pdf bib
MoBiL: A Hybrid Feature Set for Automatic Human Translation Quality Assessment
Yu Yuan | Serge Sharoff | Bogdan Babych
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we introduce MoBiL, a hybrid Monolingual, Bilingual and Language modelling feature set and feature selection and evaluation framework. The set includes translation quality indicators that can be utilized to automatically predict the quality of human translations in terms of content adequacy and language fluency. We compare MoBiL with the QuEst baseline set by using them in classifiers trained with support vector machine and relevance vector machine learning algorithms on the same data set. We also report an experiment on feature selection to opt for fewer but more informative features from MoBiL. Our experiments show that classifiers trained on our feature set perform consistently better in predicting both adequacy and fluency than the classifiers trained on the baseline feature set. MoBiL also performs well when used with both support vector machine and relevance vector machine algorithms.

pdf bib
Genre classification for a corpus of academic webpages
Erika Dalan | Serge Sharoff
Proceedings of the 10th Web as Corpus Workshop

2015

pdf bib
Book Reviews: Web Corpus Construction by Roland Schäfer and Felix Bildhauer
Serge Sharoff
Computational Linguistics, Volume 41, Issue 1 - March 2015

pdf bib
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora
Pierre Zweigenbaum | Serge Sharoff | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib
Obtaining SMT dictionaries for related languages
Miguel Rios | Serge Sharoff
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib
BUCC Shared Task: Cross-Language Document Similarity
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

pdf bib
Applying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres
Anisya Katinskaya | Serge Sharoff
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Large Scale Translation Quality Estimation
Miguel Angel Rios Gaona | Serge Sharoff
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf bib
Designing and Evaluating a Reliable Corpus of Web Genres via Crowd-Sourcing
Noushin Rezapour Asheghi | Serge Sharoff | Katja Markert
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Research in Natural Language Processing often relies on a large collection of manually annotated documents. However, currently there is no reliable genre-annotated corpus of web pages to be employed in Automatic Genre Identification (AGI). In AGI, documents are classified based on their genres rather than their topics or subjects. The major shortcoming of available web genre collections is their relatively low inter-coder agreement. Reliability of annotated data is an essential factor for reliability of the research result. In this paper, we present the first web genre corpus which is reliably annotated. We developed precise and consistent annotation guidelines which consist of well-defined and well-recognized categories. For annotating the corpus, we used crowd-sourcing which is a novel approach in genre annotation. We computed the overall as well as the individual categories’ chance-corrected inter-annotator agreement. The results show that the corpus has been annotated reliably.

pdf bib
Extracting Multiword Translations from Aligned Comparable Documents
Reinhard Rapp | Serge Sharoff
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Semi-supervised Graph-based Genre Classification for Web Pages
Noushin Rezapour Asheghi | Katja Markert | Serge Sharoff
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing

pdf bib
Evaluating Term Extraction Methods for Interpreters
Ran Xu | Serge Sharoff
Proceedings of the 4th International Workshop on Computational Terminology (Computerm)

pdf bib
Multiple views as aid to linguistic annotation error analysis
Marilena Di Bari | Serge Sharoff | Martin Thomas
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

2013

pdf bib
English-to-Russian MT evaluation campaign
Pavel Braslavski | Alexander Beloborodov | Maxim Khalilov | Serge Sharoff
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora
Serge Sharoff | Pierre Zweigenbaum | Reinhard Rapp
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

2012

pdf bib
Beyond translation memories: finding similar documents in comparable corpora
Serge Sharoff
Proceedings of Translating and the Computer 34

pdf bib
Identifying Word Translations from Comparable Documents Without a Seed Lexicon
Reinhard Rapp | Serge Sharoff | Bogdan Babych
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The extraction of dictionaries from parallel text corpora is an established technique. However, as parallel corpora are a scarce resource, in recent years the extraction of dictionaries using comparable corpora has obtained increasing attention. In order to find a mapping between languages, almost all approaches suggested in the literature rely on a seed lexicon. The work described here achieves competitive results without requiring such a seed lexicon. Instead it presupposes mappings between comparable documents in different languages. For some common types of textual resources (e.g. encyclopedias or newspaper texts) such mappings are either readily available or can be established relatively easily. The current work is based on Wikipedias where the mappings between languages are determined by the authors of the articles. We describe a neural-network inspired algorithm which first characterizes each Wikipedia article by a number of keywords, and then considers the identification of word translations as a variant of word alignment in a noisy environment. We present results and evaluations for eight language pairs involving Germanic, Romanic, and Slavic languages as well as Chinese.

pdf bib
Design of a hybrid high quality machine translation system
Bogdan Babych | Kurt Eberle | Johanna Geiß | Mireia Ginestí-Rosell | Anthony Hartley | Reinhard Rapp | Serge Sharoff | Martin Thomas
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

2011

pdf bib
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Pierre Zweigenbaum | Reinhard Rapp | Serge Sharoff
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources
Siva Reddy | Serge Sharoff
Proceedings of the Fifth International Workshop On Cross Lingual Information Access

2010

pdf bib
The Web Library of Babel: evaluating genre collections
Serge Sharoff | Zhili Wu | Katja Markert
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present experiments in automatic genre classification on web corpora, comparing a wide variety of features on several different genreannotated datasets (HGC, I-EN, KI-04, KRYS-I, MGC and SANTINIS).We investigate the performance of several types of features (POS n-grams, character n-grams and word n-grams) and show that simple character n-grams perform best on current collections because of their ability to generalise both lexical and syntactic phenomena related to genres. However, we also show that these impressive results might not be transferrable to the wider web due to the lack of comparability between different annotation labels (many webpages cannot be described in terms of the genre labels in individual collections), lack of representativeness of existing collections (many genres are represented by webpages coming from a small number of sources) as well as problems in the reliability of genre annotation (many pages from the web are difficult to interpret in terms of the labels available). This suggests that more research is needed to understand genres on the Web.

pdf bib
Fine-Grained Genre Classification Using Structural Learning Algorithms
Zhili Wu | Katja Markert | Serge Sharoff
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Advanced Corpus Solutions for Humanities Researchers
James Wilson | Anthony Hartley | Serge Sharoff | Paul Stephenson
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

2009

pdf bib
Evaluation-Guided Pre-Editing of Source Text: Improving MT-Tractability of Light Verb Constructions
Bogdan Babych | Anthony Hartley | Serge Sharoff
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2008

pdf bib
Generalising Lexical Translation Strategies for MT Using Comparable Corpora
Bogdan Babych | Serge Sharoff | Anthony Hartley
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We report on an on-going research project aimed at increasing the range of translation equivalents which can be automatically discovered by MT systems. The methodology is based on semi-supervised learning of indirect translation strategies from large comparable corpora and applying them in run-time to generate novel, previously unseen translation equivalents. This approach is different from methods based on parallel resources, which currently can reuse only individual translation equivalents. Instead it models translation strategies which generalise individual equivalents and can successfully generate an open class of new translation solutions. The task of the project is integration of the developed technology into open-source MT systems.

pdf bib
Cleaneval: a Competition for Cleaning Web Pages
Marco Baroni | Francis Chantree | Adam Kilgarriff | Serge Sharoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt

pdf bib
Designing and Evaluating a Russian Tagset
Serge Sharoff | Mikhail Kopotev | Tomaž Erjavec | Anna Feldman | Dagmar Divjak
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset is based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 500 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set that can be shared with other researchers.

pdf bib
Corpus-Based Tools for Computer-Assisted Acquisition of Reading Abilities in Cognate Languages
Svitlana Kurella | Serge Sharoff | Anthony Hartley
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents an approach to computer-assisted teaching of reading abilities using corpus data. The approach is supported by a set of tools for automatically selecting and classifying texts retrieved from the Internet. The approach is based on a linguistic model of textual cohesion which describes relations between larger textual units that go beyond the sentence level. We show that textual connectors that link such textual units reliably predict different types of texts, such as “information” and “opinion”: using only textual connectors as features, an SVM classifier achieves an F-score of between 0.85 and 0.93 for predicting these classes. The tools are used in our project on teaching reading skills in a cognate foreign language (L3) which is cognate to a known foreign language (L2).

2007

pdf bib
Translating from under-resourced languages: comparing direct transfer against pivot translation
Bogdan Babych | Anthony Hartley | Serge Sharoff
Proceedings of Machine Translation Summit XI: Papers

pdf bib
A dynamic dictionary for discovering indirect translation equivalents
Bogdan Babych | Anthony Hartley | Serge Sharoff
Proceedings of Translating and the Computer 29

pdf bib
Assisting Translators in Indirect Lexical Transfer
Bogdan Babych | Anthony Hartley | Serge Sharoff | Olga Mudraya
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
ASSIST: Automated Semantic Assistance for Translators
Serge Sharoff | Bogdan Babych | Paul Rayson | Olga Mudraya | Scott Piao
Demonstrations

pdf bib
A Uniform Interface to Large-Scale Linguistic Resources
Serge Sharoff
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In the paper we address two practical problems concerning the use of corpora in translation studies. The first stems from the limited resources available for targeted languages and genres within languages, whereas translation researchers and students need: sufficiently large modern corpora, either reflecting general language or specific to a problem domain. The second problem concerns the lackof a uniform interface for accessing the resources, even when the yexist. We deal with the first problem by developing a framework for semi-automatic acquisition of large corpora from the Internet for the languages relevant for our research and training needs. We outline the methodology used and discuss the composition of Internet-derived corpora. We deal with the second problem by developing a uniform interface to our corpora. In addition to standard options for choosingcorpora and sorting concordance lines, the interface can compute the list of collocations and filter the results according touser-specified patterns in order to detect language-specific syntacticstructures.

pdf bib
Using collocations from comparable corpora to find translation equivalents
Serge Sharoff | Bogdan Babych | Anthony Hartley
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present a tool for finding appropriate translation equivalents for words from the general lexicon using comparable corpora. For a phrase in the source language the tool suggests arange of possible expressions used in similar contexts in target language corpora. In the paper we discuss the method and present results of human evaluation of the performance of the tool.

pdf bib
Using Richly Annotated Trilingual Language Resources for Acquiring Reading Skills in a Foreign Language
Dragoş Ciobanu | Tony Hartley | Serge Sharoff
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In an age when demand for innovative and motivating language teaching methodologies is at a very high level, TREAT - the Trilingual REAding Tutor - combines the most advanced natural language processing (NLP) techniques with the latest second and third language acquisition (SLA/TLA) research in an intuitive and user-friendly environment that has been proven to help adult learners (native speakers of L1) acquire reading skills in an unknown L3 which is related to (cognate with) an L2 they know to some extent. This corpus-based methodology relies on existing linguistic resources, as well as materials that are easy to assemble, and can be adapted to support other pairs of L2-L3 related languages, as well. A small evaluation study conducted at the Leeds University Centre for Translation Studies indicates that, when using TREAT, learners feel more motivated to study an unknown L3, acquire significant linguistic knowledge of both the L3 and L2 rapidly, and increase their performance when translating from L3 into L1.

pdf bib
Using Comparable Corpora to Solve Problems Difficult for Human Translators
Serge Sharoff | Bogdan Babych | Anthony Hartley
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2004

pdf bib
Towards Basic Categories for Describing Properties of Texts in a Corpus
Serge Sharoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
What is at Stake: a Case Study of Russian Expressions Starting with a Preposition
Serge Sharoff
Proceedings of the Workshop on Multiword Expressions: Integrating Processing

2002

pdf bib
Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics
Serge Sharoff
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Multilinguality in a Text Generation System For Three Slavic Languages
Geert-Jan Kruijff | Elke Teich | John Bateman | Ivana Kruijff-Korbayova | Hana Skoumalova | Serge Sharoff | Lena Sokolova | Tony Hartley | Kamenka Staykova | Jiri Hana
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Resources for Multilingual Text Generation in Three Slavic Languages
John Bateman | Elke Teich | Geert-Jan Kruijff | Ivana Kruijff-Korbayová | Serge Sharoff | Hana Skoumalová
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)