Sadaf Abdul Rauf

Also published as: Sadaf Abdul-Rauf


2021

pdf bib
Automatic Sentence Simplification in Low Resource Settings for Urdu
Yusra Anees | Sadaf Abdul Rauf
Proceedings of the 1st Workshop on NLP for Positive Impact

To build automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. We present a lexical and syntactically simplified Urdu simplification corpus with a detailed analysis of the various simplification operations and human evaluation of corpus quality. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified and syntactically simplified corpora. In addition, we compare our corpus with other existing simplification corpora by building simplification systems and evaluating these systems using BLEU and SARI scores. Our system achieves the highest BLEU score and comparable SARI score in comparison to other systems. We release our simplification corpora for the benefit of the research community.

2020

pdf bib
Simplification automatique de texte dans un contexte de faibles ressources (Automatic Text Simplification : Approaching the Problem in Low Resource Settings for French)
Sadaf Abdul Rauf | Anne-Laure Ligozat | Francois Yvon | Gabriel Illouz | Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La simplification de textes a émergé comme un sous-domaine actif du traitement automatique des langues, du fait des problèmes pratiques et théoriques qu’elle permet d’aborder, ainsi que de ses nombreuses applications pratiques. Des corpus de simplification sont nécessaires pour entrainer des systèmes de simplification automatique ; ces ressources sont toutefois rares et n’existent que pour un petit nombre de langues. Nous montrons ici que dans un contexte où les ressources pour la simplification sont rares, il reste néanmoins possible de construire des systèmes de simplification, en ayant recours à des corpus synthétiques, par exemple obtenus par traduction automatique, et nous évaluons diverses manières de les constituer.

pdf bib
Document Level NMT of Low-Resource Languages with Backtranslation
Sami Ul Haq | Sadaf Abdul Rauf | Arsalan Shaukat | Abdullah Saeed
Proceedings of the Fifth Conference on Machine Translation

This paper describes our system submission to WMT20 shared task on similar language translation. We examined the use of documentlevel neural machine translation (NMT) systems for low-resource, similar language pair Marathi−Hindi. Our system is an extension of state-of-the-art Transformer architecture with hierarchical attention networks to incorporate contextual information. Since, NMT requires large amount of parallel data which is not available for this task, our approach is focused on utilizing monolingual data with back translation to train our models. Our experiments reveal that document-level NMT can be a reasonable alternative to sentence-level NMT for improving translation quality of low resourced languages even when used with synthetic data.

pdf bib
LIMSI @ WMT 2020
Sadaf Abdul Rauf | José Carlos Rosales Núñez | Minh Quang Pham | François Yvon
Proceedings of the Fifth Conference on Machine Translation

This paper describes LIMSI’s submissions to the translation shared tasks at WMT’20. This year we have focused our efforts on the biomedical translation task, developing a resource-heavy system for the translation of medical abstracts from English into French, using back-translated texts, terminological resources as well as multiple pre-processing pipelines, including pre-trained representations. Systems were also prepared for the robustness task for translating from English into German; for this large-scale task we developed multi-domain, noise-robust, translation systems aim to handle the two test conditions: zero-shot and few-shot domain adaptation.

pdf bib
FJWU participation for the WMT20 Biomedical Translation Task
Sumbal Naz | Sadaf Abdul Rauf | Noor-e- Hira | Sami Ul Haq
Proceedings of the Fifth Conference on Machine Translation

This paper reports system descriptions for FJWU-NRPU team for participation in the WMT20 Biomedical shared translation task. We focused our submission on exploring the effects of adding in-domain corpora extracted from various out-of-domain sources. Systems were built for French to English using in-domain corpora through fine tuning and selective data training. We further explored BERT based models specifically with focus on effect of domain adaptive subword units.

bib
Developing a Monolingual Sentence Simplification Corpus for Urdu
Yusra Anees | Sadaf Abdul Rauf | Nauman Iqbal | Abdul Basit Siddiqi
Proceedings of the The Fourth Widening Natural Language Processing Workshop

Complex sentences are a hurdle in the learning process of language learners. Sentence simplification aims to convert a complex sentence into its simpler form such that it is easily comprehensible. To build such automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. No such corpus has yet been developed for Urdu and we fill this gap by developing one such corpus to help start readability and automatic sentence simplification research. We present a lexical and syntactically simplified Urdu simplification corpus and a detailed analysis of the various simplification operations. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified, and syntactically simplified corpora.

pdf bib
On the Exploration of English to Urdu Machine Translation
Sadaf Abdul Rauf | Syeda Abida | Noor-e- Hira | Syeda Zahra | Dania Parvez | Javeria Bashir | Qurat-ul-ain Majid
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Machine Translation is the inevitable technology to reduce communication barriers in today’s world. It has made substantial progress in recent years and is being widely used in commercial as well as non-profit sectors. Such is only the case for European and other high resource languages. For English-Urdu language pair, the technology is in its infancy stage due to scarcity of resources. Present research is an important milestone in English-Urdu machine translation, as we present results for four major domains including Biomedical, Religious, Technological and General using Statistical and Neural Machine Translation. We performed series of experiments in attempts to optimize the performance of each system and also to study the impact of data sources on the systems. Finally, we established a comparison of the data sources and the effect of language model size on statistical machine translation performance.

pdf bib
Improving Document-Level Neural Machine Translation with Domain Adaptation
Sami Ul Haq | Sadaf Abdul Rauf | Arslan Shoukat | Noor-e- Hira
Proceedings of the Fourth Workshop on Neural Generation and Translation

Recent studies have shown that translation quality of NMT systems can be improved by providing document-level contextual information. In general sentence-based NMT models are extended to capture contextual information from large-scale document-level corpora which are difficult to acquire. Domain adaptation on the other hand promises adapting components of already developed systems by exploiting limited in-domain data. This paper presents FJWU’s system submission at WNGT, we specifically participated in Document level MT task for German-English translation. Our system is based on context-aware Transformer model developed on top of original NMT architecture by integrating contextual information using attention networks. Our experimental results show providing previous sentences as context significantly improves the BLEU score as compared to a strong NMT baseline. We also studied the impact of domain adaptation on document level translationand were able to improve results by adaptingthe systems according to the testing domain.

2019

pdf bib
Exploring Transfer Learning and Domain Data Selection for the Biomedical Translation
Noor-e- Hira | Sadaf Abdul Rauf | Kiran Kiani | Ammara Zafar | Raheel Nawaz
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Transfer Learning and Selective data training are two of the many approaches being extensively investigated to improve the quality of Neural Machine Translation systems. This paper presents a series of experiments by applying transfer learning and selective data training for participation in the Bio-medical shared task of WMT19. We have used Information Retrieval to selectively choose related sentences from out-of-domain data and used them as additional training data using transfer learning. We also report the effect of tokenization on translation model performance.

2011

pdf bib
Investigations on Translation Model Adaptation Using Monolingual Data
Patrik Lambert | Holger Schwenk | Christophe Servan | Sadaf Abdul-Rauf
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
LIUM’s SMT Machine Translation Systems for WMT 2011
Holger Schwenk | Patrik Lambert | Loïc Barrault | Christophe Servan | Sadaf Abdul-Rauf | Haithem Afli | Kashif Shah
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
LIUM SMT Machine Translation System for WMT 2010
Patrik Lambert | Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
On the Use of Comparable Corpora to Improve SMT performance
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
SMT and SPE Machine Translation Systems for WMT‘09
Holger Schwenk | Sadaf Abdul-Rauf | Loïc Barrault | Jean Senellart
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Exploiting Comparable Corpora with TER and TERp
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2008

pdf bib
The LIUM Arabic/English statistical machine translation system for IWSLT 2008.
Holger Schwenk | Yannick Estève | Sadaf Abdul Rauf
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the system developed by the LIUM laboratory for the 2008 IWSLT evaluation. We only participated in the Arabic/English BTEC task. We developed a statistical phrase-based system using the Moses toolkit and SYSTRAN’s rule-based translation system to perform a morphological decomposition of the Arabic words. A continuous space language model was deployed to improve the modeling of the target language. Both approaches achieved significant improvements in the BLEU score. The system achieves a score of 49.4 on the test set of the 2008 IWSLT evaluation.