Pintu Lohar

2023

pdf bib abs
Do Not Discard – Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation
Steinþór Steingrímsson | Pintu Lohar | Hrafn Loftsson | Andy Way
Proceedings of the Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation

When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English–Bengali and English–Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

2022

pdf bib abs
Building Machine Translation System for Software Product Descriptions Using Domain-specific Sub-corpora Extraction
Pintu Lohar | Sinead Madden | Edmond O’Connor | Maja Popovic | Tanya Habruseva
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Building Machine Translation systems for a specific domain requires a sufficiently large and good quality parallel corpus in that domain. However, this is a bit challenging task due to the lack of parallel data in many domains such as economics, science and technology, sports etc. In this work, we build English-to-French translation systems for software product descriptions scraped from LinkedIn website. Moreover, we developed a first-ever test parallel data set of product descriptions. We conduct experiments by building a baseline translation system trained on general domain and then domain-adapted systems using sentence-embedding based corpus filtering and domain-specific sub-corpora extraction. All the systems are tested on our newly developed data set mentioned earlier. Our experimental evaluation reveals that the domain-adapted model based on our proposed approaches outperforms the baseline.

pdf bib abs
Developing Machine Translation Engines for Multilingual Participatory Spaces
Pintu Lohar | Guodong Xie | Andy Way
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

It is often a challenging task to build Machine Translation (MT) engines for a specific domain due to the lack of parallel data in that area. In this project, we develop a range of MT systems for 6 European languages (English, German, Italian, French, Polish and Irish) in all directions and in two domains (environment and economics).

pdf bib abs
TransCasm: A Bilingual Corpus of Sarcastic Tweets
Desline Simon | Sheila Castilho | Pintu Lohar | Haithem Afli
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences

Sarcasm is extensively used in User Generated Content (UGC) in order to express one’s discontent, especially through blogs, forums, or social media such as Twitter. Several works have attempted to detect and analyse sarcasm in UGC. However, the lack of freely available corpora in this field makes the task even more difficult. In this work, we present “TransCasm” corpus, a parallel corpus of sarcastic tweets translated from English into French along with their non-sarcastic representations. To build the bilingual corpus of sarcasm, we select the “SIGN” corpus, a monolingual data set of sarcastic tweets and their non-sarcastic interpretations, created by (Peled and Reichart, 2017). We propose to define linguistic guidelines for developing “TransCasm” which is the first ever bilingual corpus of sarcastic tweets. In addition, we utilise “TransCasm” for building a binary sarcasm classifier in order to identify whether a tweet is sarcastic or not. Our experiment reveals that the sarcasm classifier achieves 61% accuracy on detecting sarcasm in tweets. “TransCasm” is now freely available online and is ready to be explored for further research.

2021

pdf bib abs
Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches
Steinþór Steingrímsson | Pintu Lohar | Hrafn Loftsson | Andy Way
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English–Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.

2020

pdf bib
The Impact of Indirect Machine Translation on Sentiment Classification
Alberto Poncelas | Pintu Lohar | James Hadley | Andy Way
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

2019

pdf bib abs
Building English-to-Serbian Machine Translation System for IMDb Movie Reviews
Pintu Lohar | Maja Popović | Andy Way
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

This paper reports the results of the first experiment dealing with the challenges of building a machine translation system for user-generated content involving a complex South Slavic language. We focus on translation of English IMDb user movie reviews into Serbian, in a low-resource scenario. We explore potentials and limits of (i) phrase-based and neural machine translation systems trained on out-of-domain clean parallel data from news articles (ii) creating additional synthetic in-domain parallel corpus by machine-translating the English IMDb corpus into Serbian. Our main findings are that morphology and syntax are better handled by the neural approach than by the phrase-based approach even in this low-resource mismatched domain scenario, however the situation is different for the lexical aspect, especially for person names. This finding also indicates that in general, machine translation of person names into Slavic languages (especially those which require/allow transcription) should be investigated more systematically.

2018

pdf bib
FooTweets: A Bilingual Parallel Corpus of World Cup Tweets
Henny Sluyter-Gäthje | Pintu Lohar | Haithem Afli | Andy Way
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Balancing Translation Quality and Sentiment Preservation (Non-archival Extended Abstract)
Pintu Lohar | Haithem Afli | Andy Way
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

2017

pdf bib abs
Using Images to Improve Machine-Translating E-Commerce Product Listings.
Iacer Calixto | Daniel Stein | Evgeny Matusov | Pintu Lohar | Sheila Castilho | Andy Way
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper we study the impact of using images to machine-translate user-generated e-commerce product listings. We study how a multi-modal Neural Machine Translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attentional NMT and a Statistical Machine Translation (SMT) model. User-generated product listings often do not constitute grammatical or well-formed sentences. More often than not, they consist of the juxtaposition of short phrases or keywords. We train our models end-to-end as well as use text-only and multi-modal NMT models for re-ranking n-best lists generated by an SMT model. We qualitatively evaluate our user-generated training data also analyse how adding synthetic data impacts the results. We evaluate our models quantitatively using BLEU and TER and find that (i) additional synthetic data has a general positive impact on text-only and multi-modal NMT models, and that (ii) using a multi-modal NMT model for re-ranking n-best lists improves TER significantly across different n-best list sizes.

pdf bib abs
ADAPT at IJCNLP-2017 Task 4: A Multinomial Naive Bayes Classification Approach for Customer Feedback Analysis task
Pintu Lohar | Koel Dutta Chowdhury | Haithem Afli | Mohammed Hasanuzzaman | Andy Way
Proceedings of the IJCNLP 2017, Shared Tasks

In this age of the digital economy, promoting organisations attempt their best to engage the customers in the feedback provisioning process. With the assistance of customer insights, an organisation can develop a better product and provide a better service to its customer. In this paper, we analyse the real world samples of customer feedback from Microsoft Office customers in four languages, i.e., English, French, Spanish and Japanese and conclude a five-plus-one-classes categorisation (comment, request, bug, complaint, meaningless and undetermined) for meaning classification. The task is to %access multilingual corpora annotated by the proposed meaning categorization scheme and develop a system to determine what class(es) the customer feedback sentences should be annotated as in four languages. We propose following approaches to accomplish this task: (i) a multinomial naive bayes (MNB) approach for multi-label classification, (ii) MNB with one-vs-rest classifier approach, and (iii) the combination of the multilabel classification-based and the sentiment classification-based approach. Our best system produces F-scores of 0.67, 0.83, 0.72 and 0.7 for English, Spanish, French and Japanese, respectively. The results are competitive to the best ones for all languages and secure 3rd and 5th position for Japanese and French, respectively, among all submitted systems.

pdf bib abs
MultiNews: A Web collection of an Aligned Multimodal and Multilingual Corpus
Haithem Afli | Pintu Lohar | Andy Way
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora

Integrating Natural Language Processing (NLP) and computer vision is a promising effort. However, the applicability of these methods directly depends on the availability of a specific multimodal data that includes images and texts. In this paper, we present a collection of a Multimodal corpus of comparable texts and their images in 9 languages from the web news articles of Euronews website. This corpus has found widespread use in the NLP community in Multilingual and multimodal tasks. Here, we focus on its acquisition of the images and text data and their multilingual alignment.