Heshaam Faili


2024

pdf bib
EPOQUE: An English-Persian Quality Estimation Dataset
Mohammed Hossein Jafari Harandi | Fatemeh Azadi | Mohammad Javad Dousti | Heshaam Faili
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Translation quality estimation (QE) is an important component in real-world machine translation applications. Unfortunately, human labeled QE datasets, which play an important role in developing and assessing QE models, are only available for limited language pairs. In this paper, we present the first English-Persian QE dataset, called EPOQUE, which has manually annotated direct assessment labels. EPOQUE contains 1000 sentences translated from English to Persian and annotated by three human annotators. It is publicly available, and thus can be used as a zero-shot test set, or for other scenarios in future work. We also evaluate and report the performance of two state-of-the-art QE models, i.e., Transquest and CometKiwi, as baselines on our dataset. Furthermore, our experiments show that using a small subset of the proposed dataset containing 300 sentences to fine-tune Transquest, can improve its performance by more that 8% in terms of the Pearson correlation with a held-out test set.

pdf bib
Esposito: An English-Persian Scientific Parallel Corpus for Machine Translation
Mersad Esalati | Mohammad Javad Dousti | Heshaam Faili
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Neural machine translation requires large number of parallel sentences along with in-domain parallel data to attain best results. Nevertheless, no scientific parallel corpus for English-Persian language pair is available. In this paper, a parallel corpus called Esposito is introduced, which contains 3.5 million parallel sentences in the scientific domain for English-Persian language pair. In addition, we present a manually validated scientific test set that might serve as a baseline for future studies. We show that a system trained using Esposito along with other publicly available data improves the baseline on average by 7.6 and 8.4 BLEU scores for En->Fa and Fa->En directions, respectively. Additionally, domain analysis using the 5-gram KenLM model revealed notable distinctions between our parallel corpus and the existing generic parallel corpus. This dataset will be available to the public upon the acceptance of the paper.

2023

pdf bib
PMI-Align: Word Alignment With Point-Wise Mutual Information Without Requiring Parallel Training Data
Fatemeh Azadi | Heshaam Faili | Mohammad Javad Dousti
Findings of the Association for Computational Linguistics: ACL 2023

Word alignment has many applications including cross-lingual annotation projection, bilingual lexicon extraction, and the evaluation or analysis of translation outputs. Recent studies show that using contextualized embeddings from pre-trained multilingual language models could give us high quality word alignments without the need of parallel training data. In this work, we propose PMI-Align which computes and uses the point-wise mutual information between source and target tokens to extract word alignments, instead of the cosine similarity or dot product which is mostly used in recent approaches. Our experiments show that our proposed PMI-Align approach could outperform the rival methods on five out of six language pairs. Although our approach requires no parallel training data, we show that this method could also benefit the approaches using parallel data to fine-tune pre-trained language models on word alignments. Our code and data are publicly available.

2022

pdf bib
PerCQA: Persian Community Question Answering Dataset
Naghme Jamali | Yadollah Yaghoobzadeh | Heshaam Faili
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Community Question Answering (CQA) forums provide answers to many real-life questions. These forums are trendy among machine learning researchers due to their large size. Automatic answer selection, answer ranking, question retrieval, expert finding, and fact-checking are example learning tasks performed using CQA data. This paper presents PerCQA, the first Persian dataset for CQA. This dataset contains the questions and answers crawled from the most well-known Persian forum. After data acquisition, we provide rigorous annotation guidelines in an iterative process and then the annotation of question-answer pairs in SemEvalCQA format. PerCQA contains 989 questions and 21,915 annotated answers. We make PerCQA publicly available to encourage more research in Persian CQA. We also build strong benchmarks for the task of answer selection in PerCQA by using mono- and multi-lingual pre-trained language models.

2021

pdf bib
NSURL-2021 Shared Task 1: Semantic Relation Extraction in Persian
Nasrin Taghizadeh | Ali Ebrahimi | Heshaam Faili
Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021

pdf bib
PerSpellData: An Exhaustive Parallel Spell Dataset For Persian
Romina Oji | Nasrin Taghizadeh | Heshaam Faili
Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021

pdf bib
NLP-IIS@UT at SemEval-2021 Task 4: Machine Reading Comprehension using the Long Document Transformer
Hossein Basafa | Sajad Movahedi | Ali Ebrahimi | Azadeh Shakery | Heshaam Faili
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents a technical report of our submission to the 4th task of SemEval-2021, titled: Reading Comprehension of Abstract Meaning. In this task, we want to predict the correct answer based on a question given a context. Usually, contexts are very lengthy and require a large receptive field from the model. Thus, common contextualized language models like BERT miss fine representation and performance due to the limited capacity of the input tokens. To tackle this problem, we used the longformer model to better process the sequences. Furthermore, we utilized the method proposed in the longformer benchmark on wikihop dataset which improved the accuracy on our task data from (23.01% and 22.95%) achieved by the baselines for subtask 1 and 2, respectively, to (70.30% and 64.38%).

2016

pdf bib
Improving Word Alignment of Rare Words with Word Embeddings
Masoud Jalili Sabet | Heshaam Faili | Gholamreza Haffari
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We address the problem of inducing word alignment for language pairs by developing an unsupervised model with the capability of getting applied to other generative alignment models. We approach the task by: i)proposing a new alignment model based on the IBM alignment model 1 that uses vector representation of words, and ii)examining the use of similar source words to overcome the problem of rare source words and improving the alignments. We apply our method to English-French corpora and run the experiments with different sizes of sentence pairs. Our results show competitive performance against the baseline and in some cases improve the results up to 6.9% in terms of precision.

2015

pdf bib
On the Importance of Ezafe Construction in Persian Parsing
Alireza Nourian | Mohammad Sadegh Rasooli | Mohsen Imany | Heshaam Faili
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
A Probabilistic Approach to Persian Ezafe Recognition
Habibollah Asghari | Jalal Maleki | Heshaam Faili
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf bib
Supervised Morphology Generation Using Parallel Corpus
Alireza Mahmoudi | Mohsen Arabsorkhi | Heshaam Faili
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Discourse-aware Statistical Machine Translation as a Context-sensitive Spell Checker
Behzad Mirzababaei | Heshaam Faili | Nava Ehsan
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Automatic Enhancement of LTAG Treebank
Farzaneh Zarei | Ali Basirat | Heshaam Faili | Maryam Sadat Mirian
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
Collocation Extraction using Parallel Corpus
Kavosh Asadi Atui | Heshaam Faili | Kaveh Assadi Atuie
Proceedings of COLING 2012: Posters

pdf bib
Fast Unsupervised Dependency Parsing with Arc-Standard Transitions
Mohammad Sadegh Rasooli | Heshaam Faili
Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP

2011

pdf bib
Constructing Linguistically Motivated Structures from Statistical Grammars
Ali Basirat | Heshaam Faili
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Unsupervised Learning for Persian WordNet Construction
Mortaza Montazery | Heshaam Faili
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
Automatic Persian WordNet Construction
Mortaza Montazery | Heshaam Faili
Coling 2010: Posters

2009

pdf bib
From Partial toward Full Parsing
Heshaam Faili
Proceedings of the International Conference RANLP-2009