Mehrnoush Shamsfard

2025

pdf bib abs
Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles
Vahide Tajalli | Mehrnoush Shamsfard | Fateme Kalantari
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language undergoes some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. In the present paper, the methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level is described. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs. The observed differences between formal and informal writing are explained in detail.

pdf bib abs
Multi-Layered Evaluation Using a Fusion of Metrics and LLMs as Judges in Open-Domain Question Answering
Rashin Rahnamoun | Mehrnoush Shamsfard
Proceedings of the 31st International Conference on Computational Linguistics

Automatic evaluation of machine-generated texts, such as answers in open-domain question answering (Open-Domain QA), presents a complex challenge involving cost efficiency, hardware constraints, and high accuracy. Although various metrics exist for comparing machine-generated answers with reference (gold standard) answers, ranging from lexical metrics (e.g., exact match) to semantic ones (e.g., cosine similarity) and using large language models (LLMs) as judges, none of these approaches achieves perfect performance in terms of accuracy or cost. To address this issue, we propose two approaches to enhance evaluation. First, we summarize long answers and use the shortened versions in the evaluation process, demonstrating that this adjustment significantly improves both lexical matching and semantic-based metrics evaluation results. Second, we introduce a multi-layered evaluation methodology that combines different metrics tailored to various scenarios. This combination of simple metrics delivers performance comparable to LLMs as judges but at lower costs. Moreover, our fused approach, which integrates both lexical and semantic metrics with LLMs through our formula, outperforms previous evaluation solutions.

Evaluation of large language models (LLMs) in low-resource languages like Persian has received less attention than in high-resource languages like English. Existing evaluation approaches for Persian LLMs generally lack comprehensive frameworks, limiting their ability to assess models’ performance over a wide range of tasks requiring considerable cultural and contextual knowledge, as well as a deeper understanding of Persian literature and style. This paper first aims to fill this gap by providing two new benchmarks, PeKA and PK-BETS, on topics such as history, literature, and cultural knowledge, as well as challenging the present state-of-the-art models’ abilities in a variety of Persian language comprehension tasks. These datasets are meant to reduce data contamination while providing an accurate assessment of Persian LLMs. The second aim of this paper is the general evaluation of LLMs across the current Persian benchmarks to provide a comprehensive performance overview. By offering a structured evaluation methodology, we hope to promote the examination of LLMs in the Persian language.

pdf bib abs
FaBERT: Pre-training BERT on Persian Blogs
Mostafa Masumi | Seyed Soroush Majd | Mehrnoush Shamsfard | Hamid Beigy
Proceedings of the Tenth Workshop on Noisy and User-generated Text

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications.

2024

pdf bib abs
RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts
Mohammad Heydari Rad | Farhan Farsi | Shayan Bali | Romina Etezadi | Mehrnoush Shamsfard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.

2023

pdf bib abs
Predicting Compositionality of Verbal Multiword Expressions in Persian
Mahtab Sarlak | Yalda Yarandi | Mehrnoush Shamsfard
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

The identification of Verbal Multiword Expressions (VMWEs) presents a greater challenge compared to non-verbal MWEs due to their higher surface variability. VMWEs are linguistic units that exhibit varying levels of semantic opaqueness and pose difficulties for computational models in terms of both their identification and the degree of compositionality. In this study, a new approach to predicting the compositional nature of VMWEs in Persian is presented. The method begins with an automatic identification of VMWEs in Persian sentences, which is approached as a sequence labeling problem for recognizing the components of VMWEs. The method then creates word embeddings that better capture the semantic properties of VMWEs and uses them to determine the degree of compositionality through multiple criteria. The study compares two neural architectures for identification, BiLSTM and ParsBERT, and shows that a fine-tuned BERT model surpasses the BiLSTM model in evaluation metrics with an F1 score of 89%. Next, a word2vec embedding model is trained to capture the semantics of identified VMWEs and is used to estimate their compositionality, resulting in an accuracy of 70.9% as demonstrated by experiments on a collected dataset of expert-annotated compositional and non-compositional VMWEs.

We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.

2021

pdf bib abs
Persian SemCor: A Bag of Word Sense Annotated Corpus for the Persian Language
Hossein Rouhizadeh | Mehrnoush Shamsfard | Mahdi Dehghan | Masoud Rouhizadeh
Proceedings of the 11th Global Wordnet Conference

Supervised approaches usually achieve the best performance in the Word Sense Disambiguation problem. However, the unavailability of large sense annotated corpora for many low-resource languages make these approaches inapplicable for them in practice. In this paper, we mitigate this issue for the Persian language by proposing a fully automatic approach for obtaining Persian SemCor (PerSemCor), as a Persian Bag-of-Word (BoW) sense-annotated corpus. We evaluated PerSemCor both intrinsically and extrinsically and showed that it can be effectively used as training sets for Persian supervised WSD systems. To encourage future research on Persian Word Sense Disambiguation, we release the PerSemCor in http://nlp.sbu.ac.ir.

pdf bib
Improving Persian Relation Extraction Models By Data Augmentation
Moein Salimi Sartakhti | Romina Etezadi | Mehrnoush Shamsfard
Proceedings of the Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021

2019

pdf bib
Beheshti-NER: Persian named entity recognition Using BERT
Ehsan Taher | Seyed Abbas Hoseini | Mehrnoush Shamsfard
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers

pdf bib abs
EoANN: Lexical Semantic Relation Classification Using an Ensemble of Artificial Neural Networks
Rayehe Hosseini Pour | Mehrnoush Shamsfard
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Researchers use wordnets as a knowledge base in many natural language processing tasks and applications, such as question answering, textual entailment, discourse classification, and so forth. Lexico-semantic relations among words or concepts are important parts of knowledge encoded in wordnets. As the use of wordnets becomes extensively widespread, extending the existing ones gets more attention. Manually construction and extension of lexico-semantic relations for WordNets or knowledge graphs are very time-consuming. Using automatic relation extraction methods can speed up this process. In this study, we exploit an ensemble of lstm and convolutional neural networks in a supervised manner to capture lexico-semantic relations which can either be used directly in NLP applications or compose the edges of wordnets. The whole procedure of learning vector space representation of relations is language independent. We used Princeton WordNet 3.1, FarsNet 3.0 (the Persian wordnet), Root09 and EVALution as golden standards to evaluate the predictive performance of our model and the results are comparable on the two languages. Empirical results demonstrate that our model outperforms the state of the art models.

bib abs
Knowledge-Based Word Sense Disambiguation with Distributional Semantic Expansion
Hossein Rouhizadeh | Mehrnoush Shamsfard | Masoud Rouhizadeh
Proceedings of the 2019 Workshop on Widening NLP

In this paper, we presented a WSD system that uses LDA topics for semantic expansion of document words. Our system also uses sense frequency information from SemCor to give higher priority to the senses which are more probable to happen.

2018

pdf bib abs
Extraction of Verbal Synsets and Relations for FarsNet
Fatemeh Khalghani | Mehrnoush Shamsfard
Proceedings of the 9th Global Wordnet Conference

WordNet or ontology development for resource-poor languages like Persian, requires composition of several strategies and employment of appropriate heuristics. Lexical and linguistic structured resources are limited for Persian and there is a lot of diversity and structural and syntagmatic complexities. This paper proposes a system for extraction of verbal synsets and relations to extend FarsNet (Persian WordNet). The proposed method extracts verbal words and concepts using noun and adjective words and synsets. It exploits the data from digital lexicon glossaries, which leads to the identification of 6890 proper verbal words and 2790 verbal synsets, with 91% and 67% precision respectively. The proposed system also extracts relations such as semantic roles of verbal arguments (instrument, location, agent, and patient) and also “related-to” (unlabeled) relations and co-occurrence among verbs and other concepts. For this purpose, a combination of linguistic approaches such as morphological analysis of words, semantic analysis, and use of key phrases and syntactic and semantic patterns, corpus-based approach, statistical techniques and co-occurrence analysis have been utilized. The presented strategy extracts 5600 proper relations between the existing concepts in FarsNet 2.0 with 76% precision.

2017

pdf bib abs
Mahtab at SemEval-2017 Task 2: Combination of Corpus-based and Knowledge-based Methods to Measure Semantic Word Similarity
Niloofar Ranjbar | Fatemeh Mashhadirajab | Mehrnoush Shamsfard | Rayeheh Hosseini pour | Aryan Vahid pour
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper, we describe our proposed method for measuring semantic similarity for a given pair of words at SemEval-2017 monolingual semantic word similarity task. We use a combination of knowledge-based and corpus-based techniques. We use FarsNet, the Persian Word Net, besides deep learning techniques to extract the similarity of words. We evaluated our proposed approach on Persian (Farsi) test data at SemEval-2017. It outperformed the other participants and ranked the first in the challenge.

2016

pdf bib abs
Augmenting FarsNet with New Relations and Structures for verbs
Mehrnoush Shamsfard | Yasaman Ghazanfari
Proceedings of the 8th Global WordNet Conference (GWC)

This paper discusses the semantic augmentation of FarsNet - the Persian WordNet - with new relations and structures for verbs. FarsNet1.0, the first Persian WordNet obeys the Structure of Princeton WordNet 2.1. In this paper we discuss FarsNet 2.0 in which new inter-POS relations and verb frames are added. In fact FarsNet2.0 is a combination of WordNet and VerbNet for Persian. It includes more than 30,000 lexical entries arranged in about 20,000 synsets with about 18000 mappings to Princeton WordNet synsets. There ae about 43000 relations between synsets and senses in FarsNet 2.0. It includes verb frames in two levels (syntactic and thematic) for about 200 simple Persian verbs.

pdf bib abs
Using Data Mining Techniques for Sentiment Shifter Identification
Samira Noferesti | Mehrnoush Shamsfard
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Sentiment shifters, i.e., words and expressions that can affect text polarity, play an important role in opinion mining. However, the limited ability of current automated opinion mining systems to handle shifters represents a major challenge. The majority of existing approaches rely on a manual list of shifters; few attempts have been made to automatically identify shifters in text. Most of them just focus on negating shifters. This paper presents a novel and efficient semi-automatic method for identifying sentiment shifters in drug reviews, aiming at improving the overall accuracy of opinion mining systems. To this end, we use weighted association rule mining (WARM), a well-known data mining technique, for finding frequent dependency patterns representing sentiment shifters from a domain-specific corpus. These patterns that include different kinds of shifter words such as shifter verbs and quantifiers are able to handle both local and long-distance shifters. We also combine these patterns with a lexicon-based approach for the polarity classification task. Experiments on drug reviews demonstrate that extracted shifters can improve the precision of the lexicon-based approach for polarity classification 9.25 percent.

2012

pdf bib abs
Corpus based Semi-Automatic Extraction of Persian Compound Verbs and their Relations
Somayeh Bagherbeygi | Mehrnoush Shamsfard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Nowadays, Wordnet is used in natural language processing as one of the major linguistic resources. Having such a resource for Persian language helps researchers in computational linguistics and natural language processing fields to develop more accurate systems with higher performances. In this research, we propose a model for semi-automatic construction of Persian wordnet of verbs. Compound verbs are a very productive structure in Persian and number of compound verbs is much greater than simple verbs in this language This research is aimed at finding the structure of Persian compound verbs and the relations between verb components. The main idea behind developing this system is using the wordnet of other POS categories (here means noun and adjective) to extract Persian compound verbs, their synsets and their relations. This paper focuses on three main tasks: 1.extracting compound verbs 2.extracting verbal synsets and 3.extracting the relations among verbal synsets such as hypernymy, antonymy and cause.

2010

pdf bib abs
Extracting Lexico-conceptual Knowledge for Developing Persian WordNet
Mehrnoush Shamsfard | Hakimeh Fadaei | Elham Fekri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Semantic lexicons and lexical ontologies are some major resources in natural language processing. Developing such resources are time consuming tasks for which some automatic methods are proposed. This paper describes some methods used in semi-automatic development of FarsNet; a lexical ontology for the Persian language. FarsNet includes the Persian WordNet with more than 10000 synsets of nouns, verbs and adjectives. In this paper we discuss extraction of lexico-conceptual relations such as synonymy, antonymy, hyperonymy, hyponymy, meronymy, holonymy and other lexical or conceptual relations between words and concepts (synsets) from Persian resources. Relations are extracted from different resources like web, corpora, Wikipedia, Wiktionary, dictionaries and WordNet. In the system presented in this paper a variety of approaches are applied in the task of relation extraction to extract ladled or unlabeled relations. They exploit the texts, structures, hyperlinks and statistics of web documents as well as the relations of English WordNet and entries of mono and bi-lingual dictionaries.

pdf bib abs
STeP-1: A Set of Fundamental Tools for Persian Text Processing
Mehrnoush Shamsfard | Hoda Sadat Jafari | Mahdi Ilbeygi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex preprocessing tasks. Having different writing prescriptions, spacings between or within words, character codings and spellings are some of the difficulties and challenges in converting various texts into a standard one. The lack of fundamental text processing tools such as morphological analyser (especially for derivational morphology) and POS tagger is another problem in Persian text processing. This paper introduces a set of fundamental tools for Persian text processing in STeP-1 package. STeP-1 (Standard Text Preparation for Persian language) performs a combination of tokenization, spell checking, morphological analysis and POS tagging. It also turns all Persian texts with different prescribed forms of writing to a series of tokens in the standard style introduced by Academy of Persian Language and Literature (APLL). Experimental results show high performance.

2009

pdf bib
Automatic Translation between English and Persian Texts
Chakaveh Saedi | Mehrnoush Shamsfard | Yasaman Motazedi
Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3)

pdf bib
STeP-1: Standard Text Preparation for Persian Language
Mehrnoush Shamsfard | Soheila Kiani | Yaser Shahedi
Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3)

2008

pdf bib
A Bottom Up approach to Persian Stemming
Amir Azim Sharifloo | Mehrnoush Shamsfard
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib abs
Towards Semi Automatic Construction of a Lexical Ontology for Persian
Mehrnoush Shamsfard
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Although there are a number of semantic lexicons for English and some other languages, Persian lacks such a complete resource to be used in NLP works. In this paper we introduce an ongoing project on developing a lexical ontology for Persian called FarsNet. We exploited a hybrid semi-automatic approach to acquire lexical and conceptual knowledge from resources such as WordNet, bilingual dictionaries, mono-lingual corpora and morpho-syntactic and semantic templates. FarsNet is an ontology whose elements are lexicalized in Persian. It provides links between various types of words (cross POS relations) and also between words and their corresponding concepts in other ontologies (cross ontologies relations). FarsNet aggregates the power of WordNet on nouns, the power of FrameNet on verbs and the wide range of conceptual relations from ontology community.

pdf bib abs
A Hybrid Morphology-Based POS Tagger for Persian
Mehrnoush Shamsfard | Hakimeh Fadaee
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach high performance and accuracy. These taggers usually deal with inter-word relations and they make use of lexicons. In this paper we present a new tagging algorithm with a hybrid approach. This algorithm combines the features of probabilistic and rule-based taggers to tag Persian unknown words. In contrast with many other tagging algorithms this algorithm deals with the internal structure of the words and it does not need any built in knowledge. The introduced tagging algorithm is domain independent because it uses morphological rules. In this algorithm POS tags are assigned to unknown word with a probability which shows the accuracy of the assigned POS tag. Although this tagger is proposed for Persian, it can be adapted to other languages by applying their morphological rules.