The 13th International Joint Conference on Natural Language Processing and The 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguisticss - ACL Anthology

The 13th International Joint Conference on Natural Language Processing and The 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguisticss

Nusa Dua, Bali
November 2023

Volumes

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) IJCNLP AACL 73 papers
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) IJCNLP AACL 23 papers
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop IJCNLP AACL 13 papers
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract IJCNLP AACL 7 papers
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations IJCNLP 10 papers
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings) Findings 38 papers
Proceedings of the ART of Safety: Workshop on Adversarial testing and Red-Teaming for generative AI artofsafety 7 papers
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems Eval4NLP 20 papers
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing FinNLP 14 papers
Proceedings of the Second Workshop on Natural Language Interfaces nlint 4 papers
Proceedings of the Third Workshop on NLP for Medical Conversations NLPMC 3 papers
Proceedings of the First Workshop in South East Asian Language Processing sealp 9 papers
Proceedings of the 11th International Workshop on Natural Language Processing for Social Media SocialNLP 7 papers
Proceedings of the Second Workshop on Information Extraction from Scientific Publications WIESP WASP 18 papers

bib (full) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Jong C. Park | Yuki Arase | Baotian Hu | Wei Lu | Derry Wijaya | Ayu Purwarianti | Adila Alfa Krisnadhi

Toward Unified Controllable Text Generation via Regular Expression Instruction
Xin Zheng | Hongyu Lin | Xianpei Han | Le Sun

Don’t be Blind to Questions: Question-Oriented Math Word Problem Solving
Zhenwen Liang | Jipeng Zhang | Xiangliang Zhang

SILVER: Self Data Augmentation for Out-of-Scope Detection in Dialogues
Chunpeng Ma | Takuya Makino

MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization
Potsawee Manakul | Adian Liusie | Mark Gales

MCML: A Novel Memory-based Contrastive Meta-Learning Method for Few Shot Slot Tagging
Hongru Wang | Zezhong Wang | Wai Chung Kwan | Kam-Fai Wong

RECESS: Resource for Extracting Cause, Effect, and Signal Spans
Fiona Anting Tan | Hansi Hettiarachchi | Ali Hürriyetoğlu | Nelleke Oostdijk | Tommaso Caselli | Tadashi Nomoto | Onur Uca | Farhana Ferdousi Liza | See-Kiong Ng

SYNC: A Structurally Guided Hard Negative Curricula for Generalizable Neural Code Search
Atharva Naik | Soumitra Das | Jyothi Vedurada | Somak Aditya

On a Benefit of Masked Language Model Pretraining: Robustness to Simplicity Bias
Ting-Rui Chiang

Conversation Style Transfer using Few-Shot Learning
Shamik Roy | Raphael Shu | Nikolaos Pappas | Elman Mansimov | Yi Zhang | Saab Mansour | Dan Roth

MasakhaNEWS: News Topic Classification for African languages
David Ifeoluwa Adelani | Marek Masiak | Israel Abebe Azime | Jesujoba Alabi | Atnafu Lambebo Tonja | Christine Mwase | Odunayo Ogundepo | Bonaventure F. P. Dossou | Akintunde Oladipo | Doreen Nixdorf | Chris Chinenye Emezue | Sana Al-azzawi | Blessing Sibanda | Davis David | Lolwethu Ndolela | Jonathan Mukiibi | Tunde Ajayi | Tatiana Moteu | Brian Odhiambo | Abraham Owodunni | Nnaemeka Obiefuna | Muhidin Mohamed | Shamsuddeen Hassan Muhammad | Teshome Mulugeta Ababu | Saheed Abdullahi Salahudeen | Mesay Gemeda Yigezu | Tajuddeen Gwadabe | Idris Abdulmumin | Mahlet Taye | Oluwabusayo Awoyomi | Iyanuoluwa Shode | Tolulope Adelani | Habiba Abdulganiyu | Abdul-Hakeem Omotayo | Adetola Adeeko | Abeeb Afolabi | Anuoluwapo Aremu | Olanrewaju Samuel | Clemencia Siro | Wangari Kimotho | Onyekachi Ogbu | Chinedu Mbonu | Chiamaka Chukwuneke | Samuel Fanijo | Jessica Ojo | Oyinkansola Awosan | Tadesse Kebede | Toadoum Sari Sakayo | Pamela Nyatsine | Freedmore Sidume | Oreen Yousuf | Mardiyyah Oduwole | Kanda Tshinu | Ussen Kimanuka | Thina Diko | Siyanda Nxakama | Sinodos Nigusse | Abdulmejid Johar | Shafie Mohamed | Fuad Mire Hassan | Moges Ahmed Mehamed | Evrard Ngabire | Jules Jules | Ivan Ssenkungu | Pontus Stenetorp

Automatic Translation of Span-Prediction Datasets
Ofri Masad | Kfir Bar | Amir David Nissan Cohen

Human-Like Distractor Response in Vision-Language Model
Xiaonan Xu | Haoshuo Chen

Phylogeny-Inspired Soft Prompts For Data-to-Text Generation in Low-Resource Languages
William Soto Martinez | Yannick Parmentier | Claire Gardent

Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity Recognition
Michael Beukman | Manuel Fokam

A Multimodal Analysis of Influencer Content on Twitter
Danae Sánchez Villegas | Catalina Goanta | Nikolaos Aletras

Reimagining Complaint Analysis: Adopting Seq2Path for a Generative Text-to-Text Framework
Apoorva Singh | Raghav Jain | Sriparna Saha

FollowupQG: Towards information-seeking follow-up question generation
Yan Meng | Liangming Pan | Yixin Cao | Min-Yen Kan

Zero-shot Triplet Extraction by Template Infilling
Bosung Kim | Hayate Iso | Nikita Bhutani | Estevam Hruschka | Ndapa Nakashole | Tom Mitchell

Generating and Answering Simple and Complex Questions from Text and from Knowledge Graphs
Kelvin Han | Claire Gardent

Faithful Chain-of-Thought Reasoning
Qing Lyu | Shreya Havaldar | Adam Stein | Li Zhang | Delip Rao | Eric Wong | Marianna Apidianaki | Chris Callison-Burch

Linguistic Productivity: the Case of Determiners in English
Raquel G. Alhama | Ruthe Foushee | Daniel Byrne | Allyson Ettinger | Susan Goldin-Meadow | Afra Alishahi

Informative Evidence-guided Prompt-based Fine-tuning for English-Korean Critical Error Detection
DaHyun Jung | Sugyeong Eo | Chanjun Park | Hyeonseok Moon | Jaehyung Seo | Heuiseok Lim

Assessment of Pre-Trained Models Across Languages and Grammars
Alberto Muñoz-Ortiz | David Vilares | Carlos Gómez-Rodríguez

Rethinking the Role of Entity Type in Relation Classification
Xiang Dai | Sarvnaz Karimi | Stephen Wan

Improving Neural Machine Translation with Offline Evaluations
Min-Kyung Park | Byung-Jun Lee

Query Rewriting for Effective Misinformation Discovery
Ashkan Kazemi | Artem Abzaliev | Naihao Deng | Rui Hou | Scott A. Hale | Veronica Perez-Rosas | Rada Mihalcea

24-bit Languages
Yiran Wang | Taro Watanabe | Masao Utiyama | Yuji Matsumoto

DisCGen: A Framework for Discourse-Informed Counterspeech Generation
Sabit Hassan | Malihe Alikhani

Question Answer Generation in Bengali: Mitigating the scarcity of QA datasets in a low-resource language
Md Shihab Shahriar | Ahmad Al Fayad Chowdhury | Md. Amimul Ehsan | Abu Raihan Kamal

One Sense per Translation
Bradley Hauer | Grzegorz Kondrak

Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction
Jonathan Pilault | Xavier Garcia | Arthur Bražinskas | Orhan Firat

J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News
Tharindu Kumarage | Amrita Bhattacharjee | Djordje Padejski | Kristy Roschke | Dan Gillmor | Scott Ruston | Huan Liu | Joshua Garland

We Need to Talk About Classification Evaluation Metrics in NLP
Peter Vickers | Loic Barrault | Emilio Monti | Nikolaos Aletras

Investigating Zero- and Few-shot Generalization in Fact Verification
Liangming Pan | Yunxiang Zhang | Min-Yen Kan

Attacking Open-domain Question Answering by Injecting Misinformation
Liangming Pan | Wenhu Chen | Min-Yen Kan | William Yang Wang

Emerging Challenges in Personalized Medicine: Assessing Demographic Effects on Biomedical Question Answering Systems
Sagi Shaier | Kevin Bennett | Lawrence Hunter | Katharina Kann

Smoothing Entailment Graphs with Language Models
Nick McKenna | Tianyi Li | Mark Johnson | Mark Steedman

FastRAT: Fast and Efficient Cross-lingual Text-to-SQL Semantic Parsing
Pavlos Vougiouklis | Nikos Papasarantopoulos | Danna Zheng | David Tuckey | Chenxin Diao | Zhili Shen | Jeff Pan

ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting
Abdellah El Mekki | Muhammad Abdul-Mageed | ElMoatez Billah Nagoudi | Ismail Berrada | Ahmed Khoumsi

ConDA: Contrastive Domain Adaptation for AI-generated Text Detection
Amrita Bhattacharjee | Tharindu Kumarage | Raha Moraffah | Huan Liu

A Review of Datasets for Aspect-based Sentiment Analysis
Siva Uday Sampreeth Chebolu | Franck Dernoncourt | Nedim Lipka | Thamar Solorio

MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines
Vincent Nguyen | Sarvnaz Karimi | Maciej Rybinski | Zhenchang Xing

Valla: Standardizing and Benchmarking Authorship Attribution and Verification Through Empirical Evaluation and Comparative Analysis
Jacob Tyo | Bhuwan Dhingra | Zachary C. Lipton

Sentiment Aided Graph Attentive Contextualization for Task Oriented Negotiation Dialogue Generation
Aritra Raut | Sriparna Saha | Anutosh Maitra | Roshni Ramnani

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang | Samuel Cahyawijaya | Nayeon Lee | Wenliang Dai | Dan Su | Bryan Wilie | Holy Lovenia | Ziwei Ji | Tiezheng Yu | Willy Chung | Quyet V. Do | Yan Xu | Pascale Fung

Analyzing and Predicting Persistence of News Tweets
Maggie Liu | Jing Wang | Daniel Preotiuc-Pietro

Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation
Zihan Liu | Zewei Sun | Shanbo Cheng | Shujian Huang | Mingxuan Wang

Uncertainty Estimation for Debiased Models: Does Fairness Hurt Reliability?
Gleb Kuzmin | Artem Vazhentsev | Artem Shelmanov | Xudong Han | Simon Suster | Maxim Panov | Alexander Panchenko | Timothy Baldwin

Semi-supervised News Discourse Profiling with Contrastive Learning
Ming Li | Ruihong Huang

Target-Aware Contextual Political Bias Detection in News
Iffat Maab | Edison Marrese-Taylor | Yutaka Matsuo

Controllable Discovery of Intents: Incremental Deep Clustering Using Semi-Supervised Contrastive Learning
Mrinal Rawat | Hithesh Sankararaman | Victor Barres

Benchmarking Procedural Language Understanding for Low-Resource Languages: A Case Study on Turkish
Arda Uzunoglu | Gözde Şahin

LexicoMatic: Automatic Creation of Multilingual Lexical-Semantic Dictionaries
Federico Martelli | Luigi Procopio | Edoardo Barba | Roberto Navigli

FiRo: Finite-context Indexing of Restricted Output Space for NLP Models Facing Noisy Input
Minh Nguyen | Nancy Chen

Implicit Affordance Acquisition via Causal Action–Effect Modeling in the Video Domain
Hsiu-Yu Yang | Carina Silberer

Prover: Generating Intermediate Steps for NLI with Commonsense Knowledge Retrieval and Next-Step Prediction
Deepanway Ghosal | Somak Aditya | Monojit Choudhury

Exploring the Impact of Training Data Distribution and Subword Tokenization on Gender Bias in Machine Translation
Bar Iluz | Tomasz Limisiewicz | Gabriel Stanovsky | David Mareček

GrailQA++: A Challenging Zero-Shot Benchmark for Knowledge Base Question Answering
Ritam Dutt | Sopan Khosla | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah

Model-based Subsampling for Knowledge Graph Completion
Xincan Feng | Hidetaka Kamigaito | Katsuhiko Hayashi | Taro Watanabe

The Persuasive Memescape: Understanding Effectiveness and Societal Implications of Internet Memes
Gitanjali Kumari | Pranali Shinde | Asif Ekbal

Generation of Korean Offensive Language by Leveraging Large Language Models via Prompt Design
Jisu Shin | Hoyun Song | Huije Lee | Fitsum Gaim | Jong Park

PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems
Bryan Wilie | Yan Xu | Willy Chung | Samuel Cahyawijaya | Holy Lovenia | Pascale Fung

Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method
Xuan Zhang | Wei Gao

Retrieval Augmented Generation with Rich Answer Encoding
Wenyu Huang | Mirella Lapata | Pavlos Vougiouklis | Nikos Papasarantopoulos | Jeff Pan

Examining Consistency of Visual Commonsense Reasoning based on Person Grounding
Huiju Kim | Youjin Kang | SangKeun Lee

Self-Consistent Narrative Prompts on Abductive Natural Language Inference
Chunkit Chan | Xin Liu | Tsz Ho Chan | Jiayang Cheng | Yangqiu Song | Ginny Wong | Simon See

KoBigBird-large: Transformation of Transformer for Korean Language Understanding
Kisu Yang

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models
Levon Haroutunian | Zhuang Li | Lucian Galescu | Philip Cohen | Raj Tumuluri | Gholamreza Haffari

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification
Daryna Dementieva | Daniil Moskovskiy | David Dale | Alexander Panchenko

PACT: Pretraining with Adversarial Contrastive Learning for Text Classification
Md Tawkat Islam Khondaker | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.

VACASPATI: A Diverse Corpus of Bangla Literature
Pramit Bhattacharyya | Joydeep Mondal | Subhadip Maji | Arnab Bhattacharya

bib (full) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Jong C. Park | Yuki Arase | Baotian Hu | Wei Lu | Derry Wijaya | Ayu Purwarianti | Adila Alfa Krisnadhi

Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer
Fei Wang | Kuan-Hao Huang | Kai-Wei Chang | Muhao Chen

Learning to Predict Concept Ordering for Common Sense Generation
Tianhui Zhang | Danushka Bollegala | Bei Peng

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References
Matteo Gabburo | Siddhant Garg | Rik Koncel-Kedziorski | Alessandro Moschitti

The Impact of Debiasing on the Performance of Language Models in Downstream Tasks is Underestimated
Masahiro Kaneko | Danushka Bollegala | Naoaki Okazaki

Enhancing Volatility Forecasting in Financial Markets: A General Numeral Attachment Dataset for Understanding Earnings Calls
Ming-Xuan Shi | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

Do the Benefits of Joint Models for Relation Extraction Extend to Document-level Tasks?
Pratik Saini | Tapas Nayak | Indrajit Bhattacharya

On the Challenges of Fully Incremental Neural Dependency Parsing
Ana Ezquerro | Carlos Gómez-Rodríguez | David Vilares

Learning a Better Initialization for Soft Prompts via Meta-Learning
Yukun Huang | Kun Qian | Zhou Yu

Issues Surrounding the Use of ChatGPT in Similar Languages: The Case of Malay and Indonesian
Hiroki Nomoto

Can You Translate for Me? Code-Switched Machine Translation with Large Language Models
Jyotsana Khatri | Vivek Srivastava | Lovekesh Vig

Efficient Zero-Shot Cross-lingual Inference via Retrieval
Genta Winata | Lingjue Xie | Karthik Radhakrishnan | Yifan Gao | Daniel Preotiuc-Pietro

Minimum Bayes’ Risk Decoding for System Combination of Grammatical Error Correction Systems
Vyas Raina | Mark Gales

Who Are All The Stochastic Parrots Imitating? They Should Tell Us!
Sagi Shaier | Lawrence Hunter | Katharina Kann

Incorporating Singletons and Mention-based Features in Coreference Resolution via Multi-task Learning for Better Generalization
Yilun Zhu | Siyao Peng | Sameer Pradhan | Amir Zeldes

All Labels Together: Low-shot Intent Detection with an Efficient Label Semantic Encoding Paradigm
Jiangshu Du | Congying Xia | Wenpeng Yin | Tingting Liang | Philip Yu

Theia: Weakly Supervised Multimodal Event Extraction from Incomplete Data
Farhad Moghimifar | Fatemeh Shiri | Van Nguyen | Yuan-Fang Li | Gholamreza Haffari

Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness
Rohit Jain | Huda Khayrallah | Roman Grundkiewicz | Marcin Junczys-Dowmunt

Enhancing Open-Domain Table Question Answering via Syntax- and Structure-aware Dense Retrieval
Nengzheng Jin | Dongfang Li | Junying Chen | Joanna Siebert | Qingcai Chen

The Language Model, Resources, and Computational Pipelines for the Under-Resourced Iranian Azerbaijani
Marzia Nouri | Mahsa Amani | Reihaneh Zohrabi | Ehsaneddin Asgari

Borderless Azerbaijani Processing: Linguistic Resources and a Transformer-based Approach for Azerbaijani Transliteration
Reihaneh Zohrabi | Mostafa Masumi | Omid Ghahroodi | Parham AbedAzad | Hamid Beigy | Mohammad Hossein Rohban | Ehsaneddin Asgari

Are Machine Reading Comprehension Systems Robust to Context Paraphrasing?
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro

It’s not only What You Say, It’s also Who It’s Said to: Counterfactual Analysis of Interactive Behavior in the Courtroom
Biaoyan Fang | Trevor Cohn | Timothy Baldwin | Lea Frermann

bib (full) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop
Dongfang Li | Rahmad Mahendra | Zilu Peter Tang | Hyeju Jang | Yugo Murawaki | Derek Fai Wong

Cross-lingual Transfer Learning for Javanese Dependency Parsing
Fadli Aulawi Al Ghiffari | Ika Alfina | Kurniawati Azizah

An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Kentaro Inui

Gender Inflected or Bias Inflicted: On Using Grammatical Gender Cues for Bias Evaluation in Machine Translation
Pushpdeep Singh

Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation
Tomohito Kasahara | Daisuke Kawahara

Style-sensitive Sentence Embeddings for Evaluating Similarity in Speech Style of Japanese Sentences by Contrastive Learning
Yuki Zenimoto | Shinzan Komata | Takehito Utsuro

Intermediate-Task Transfer Learning for Peer Review Score Prediction
Panitan Muangkammuen | Fumiyo Fukumoto | Jiyi Li | Yoshimi Suzuki

Speech Synthesis Model Based on Face Landmarks
Chenji Jin | Yoshimi Suzuki | Fei Lin

Rethinking Response Evaluation from Interlocutor’s Eye for Open-Domain Dialogue Systems
Yuma Tsuta | Naoki Yoshinaga | Shoetsu Sato | Masashi Toyoda

Long-form Simultaneous Speech Translation: Thesis Proposal
Peter Polák

Modeling Collaborative Dialogue in Minecraft with Action-Utterance Model
Takuma Ichikawa | Ryuichiro Higashinaka

Graph-Enriched Biomedical Language Models: A Research Proposal
Andrey Sakhovskiy | Alexander Panchenko | Elena Tutubalina

Evaluating Large Language Models’ Understanding of Financial Terminology via Definition Modeling
James Jhirad | Edison Marrese-Taylor | Yutaka Matsuo

bib (full) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
Yun-Nung (Vivian) Chen | Sadao Kurohashi

Language and Robotics: Toward Building Robots Coexisting with Human Society Using Language Interface
Yutaka Nakamura | Shuhei Kurita | Koichiro Yoshino

Current Status of NLP in South East Asia with Insights from Multilingualism and Language Diversity
Alham Fikri Aji | Jessica Zosa Forde | Alyssa Marie Loo | Lintang Sutawika | Skyler Wang | Genta Indra Winata | Zheng-Xin Yong | Ruochen Zhang | A. Seza Doğruöz | Yin Lin Tan | Jan Christian Blaise Cruz

Practical Tools from Domain Adaptation for Designing Inclusive, Equitable, and Robust Generative AI
Anthony Sicilia | Malihe Alikhani

Editing Large Language Models
Ningyu Zhang | Yunzhi Yao | Shumin Deng

Learning WHO Saying WHAT to WHOM in Multi-Party Conversations
Jia-Chen Gu | Zhuosheng Zhang | Zhen-Hua Ling

Developing State-Of-The-Art Massively Multilingual Machine Translation Systems for Related Languages
Jay Gala | Pranjal A. Chitale | Raj Dabre

bib (full) Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations
Sriparna Saha | Herry Sujaini

TRAVID: An End-to-End Video Translation Framework
Prottay Kumar Adhikary | Bandaru Sugandhi | Subhojit Ghimire | Santanu Pal | Partha Pakray

CustodiAI: A System for Predicting Child Custody Outcomes
Yining Juan | Chung-Chi Chen | Hsin-Hsi Chen | Daw-Wei Wang

Turning Whisper into Real-Time Transcription System
Dominik Macháček | Raj Dabre | Ondřej Bojar

LambdaKG: A Library for Pre-trained Language Model-Based Knowledge Graph Embeddings
Xin Xie | Zhoubo Li | Xiaohan Wang | ZeKun Xi | Ningyu Zhang

mahaNLP: A Marathi Natural Language Processing Library
Vidula Magdum | Omkar Jayant Dhekane | Sharayu Sandeep Hiwarkhedkar | Saloni Sunil Mittal | Raviraj Joshi

SAINE: Scientific Annotation and Inference Engine of Scientific Research
Susie Xi Rao | Yilei Tu | Peter H. Egger

IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Edoardo Mosca | Daryna Dementieva | Tohid Ebrahim Ajdari | Maximilian Kummeth | Kirill Gringauz | Yutong Zhou | Georg Groh

WAMP: Writing, Annotation, and Marking Platform
Geonsik Moon | Muhammad Reza Qorib | Daniel Dahlmeier | Hwee Tou Ng

ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
Pengfei Zhu | Chao Pang | Yekun Chai | Lei Li | Shuohuan Wang | Yu Sun | Hao Tian | Hua Wu

bib (full) Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)
Jong C. Park | Yuki Arase | Baotian Hu | Wei Lu | Derry Wijaya | Ayu Purwarianti | Adila Alfa Krisnadhi

Localize, Retrieve and Fuse: A Generalized Framework for Free-Form Question Answering over Tables
Wenting Zhao | Ye Liu | Yao Wan | Yibo Wang | Zhongfen Deng | Philip S. Yu

Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach
Yibo Wang | Wenting Zhao | Yao Wan | Zhongfen Deng | Philip Yu

SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis
Imtiaz Karim | Kazi Samin Mubasshir | Mirza Masfiqur Rahman | Elisa Bertino

PRiSM: Enhancing Low-Resource Document-Level Relation Extraction with Relation-Aware Score Calibration
Minseok Choi | Hyesu Lim | Jaegul Choo

Improving Query-Focused Meeting Summarization with Query-Relevant Knowledge
Tiezheng Yu | Ziwei Ji | Pascale Fung

Learning to Diversify Neural Text Generation via Degenerative Model
Jimin Hong | ChaeHun Park | Jaegul Choo

A Neighbourhood-Aware Differential Privacy Mechanism for Static Word Embeddings
Danushka Bollegala | Shuichi Otake | Tomoya Machide | Ken-ichi Kawarabayashi

PhraseSumm: Abstractive Short Phrase Summarization
Kasturi Bhattacharjee | Kathleen McKeown | Rashmi Gangadharaiah

Location Aware Modular Biencoder for Tourism Question Answering
Haonan Li | Martin Tomko | Timothy Baldwin

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation
Teven Le Scao | Claire Gardent

Unsupervised Multi-document Summarization with Holistic Inference
Haopeng Zhang | Sangwoo Cho | Kaiqiang Song | Xiaoyang Wang | Hongwei Wang | Jiawei Zhang | Dong Yu

Predicting Terms in IS-A Relations with Pre-trained Transformers
Irina Nikishina | Polina Chernomorchenko | Anastasiia Demidova | Alexander Panchenko | Chris Biemann

Context Helps Determine Spatial Knowledge from Tweets
Zhaomin Xiao | Yan Huang | Eduardo Blanco

Multilingual Non-Autoregressive Machine Translation without Knowledge Distillation
Chenyang Huang | Fei Huang | Zaixiang Zheng | Osmar Zaïane | Hao Zhou | Lili Mou

Knowledge Injection with Perturbation-based Constrained Attention Network for Word Sense Disambiguation
Fumiyo Fukumoto | Shou Asakawa

The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo | Maxime Peyrard | Nathan Noiry | Robert West | Pablo Piantanida

A Novel Information Theoretic Objective to Disentangle Representations for Fair Classification
Pierre Colombo | Nathan Noiry | Guillaume Staerman | Pablo Piantanida

Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
Hayastan Avetisyan | David Broneske

Multi-Target Semantic Parsing with Collaborative Deliberation Network
Xiang Li | Fangyu Lei | Shizhu He | Kang Liu | Jun Zhao

Improving Machine Reading Comprehension through A Simple Masked-Training Scheme
Xun Yao | Junlong Ma | Xinrong Hu | Jie Yang | Yuan-Fang Li

A Comprehensive Neural and Behavioral Task Taxonomy Method for Transfer Learning in NLP
Yunhao Zhang | Chong Li | Xiaohan Zhang | Xinyi Dong | Shaonan Wang

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks
Tanmay Chavan | Omkar Gokhale | Aditya Kane | Shantanu Patankar | Raviraj Joshi

Template Filling for Controllable Commonsense Reasoning
Dheeraj Rajagopal | Vivek Khetan | Bogdan Sacaleanu | Anatole Gershman | Andrew E. Fano Fano | Eduard Hovy

Temporal Relation Classification in Hebrew
Guy Yanko | Shahaf Pariente | Kfir Bar

Privacy Adhering Machine Un-learning in NLP
Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah | Dan Roth

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Atakan Kara | Farrin Marouf Sofian | Andrew Bond | Gözde Şahin

Interactively Learning Social Media Representations Improves News Source Factuality Detection
Nikhil Mehta | Dan Goldwasser

IndIE: A Multilingual Open Information Extraction Tool For Indic Languages
Ritwik Mishra | Simranjeet Singh | Rajiv Ratn Shah | Ponnurangam Kumaraguru | Pushpak Bhattacharyya

Mitigating Word Bias in Zero-shot Prompt-based Classifiers
Adian Liusie | Potsawee Manakul | Mark Gales

Mixing It Up: Inducing Empathy and Politeness using Multiple Behaviour-aware Generators for Conversational Systems
Mauajama Firdaus | Priyanshu Priya | Asif Ekbal

Few-Shot Adaptation for Parsing Contextual Utterances with LLMs
Kevin Lin | Patrick Xia | Hao Fang

Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
Yi Chen | Rui Wang | Haiyun Jiang | Shuming Shi | Ruifeng Xu

A Text-to-Text Model for Multilingual Offensive Language Identification
Tharindu Ranasinghe | Marcos Zampieri

Few-shot Named Entity Recognition with Supported and Dependent Label Representations
Yasuhide Miura | Takumi Takahashi

What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples
Shakila Mahjabin Tonni | Mark Dras

Supervised Clustering Loss for Clustering-Friendly Sentence Embeddings: an Application to Intent Clustering
Giorgio Barnabò | Antonio Uva | Sandro Pollastrini | Chiara Rubagotti | Davide Bernardi

STRONG – Structure Controllable Legal Opinion Summary Generation
Yang Zhong | Diane Litman

bib (full) Proceedings of the ART of Safety: Workshop on Adversarial testing and Red-Teaming for generative AI

Proceedings of the ART of Safety: Workshop on Adversarial testing and Red-Teaming for generative AI
Alicia Parrish

Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks
Aleksander Buszydlik | Karol Dobiczek | Michał Teodor Okoń | Konrad Skublicki | Philip Lippmann | Jie Yang

Student-Teacher Prompting for Red Teaming to Improve Guardrails
Rodrigo Revilla Llaca | Victoria Leskoschek | Vitor Costa Paiva | Cătălin Lupău | Philip Lippmann | Jie Yang

Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge
Manuel Brack | Patrick Schramowski | Kristian Kersting

Measuring Adversarial Datasets
Yuanchen Bai | Raoyi Huang | Vijay Viswanathan | Tzu-Sheng Kuo | Tongshuang Wu

Discovering Safety Issues in Text-to-Image Models: Insights from Adversarial Nibbler Challenge
Gauri Sharma

Uncovering Bias in AI-Generated Images
Kimberley Baxter

pdf (full)
bib (full) Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Daniel Deutsch | Rotem Dror | Steffen Eger | Yang Gao | Christoph Leiter | Juri Opitz | Andreas Rücklé

WRF: Weighted Rouge-F1 Metric for Entity Recognition
Lukas Weber | Krishnan Jothi Ramalingam | Matthias Beyer | Axel Zimmermann

The continuous progress in Named Entity Recognition allows the identification of complex entities in multiple domains. The traditionally used metrics like precision, recall, and F1-score can only reflect the classification quality of the underlying NER model to a limited extent. Existing metrics do not distinguish between a non-recognition of an entity and a misclassification of an entity. Additionally, the dealing with redundant entities remains unaddressed. We propose WRF, a Weighted Rouge F1 metric for Entity Recognition, to solve the mentioned gaps in currently available metrics. We successfully employ the WRF metric for automotive entity recognition, followed by a comprehensive qualitative and quantitative analysis of the obtained results.

Assessing Distractors in Multiple-Choice Tests
Vatsal Raina | Adian Liusie | Mark Gales

Multiple-choice tests are a common approach for assessing candidates’ comprehension skills. Standard multiple-choice reading comprehension exams require candidates to select the correct answer option from a discrete set based on a question in relation to a contextual passage. For appropriate assessment, the distractor answer options must by definition be incorrect but plausible and diverse. However, generating good quality distractors satisfying these criteria is a challenging task for content creators. We propose automated assessment metrics for the quality of distractors in multiple-choice reading comprehension tests. Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options. We assess incorrectness using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is assessed by considering the distractor confidence - the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by pairwise comparison of an embedding-based equivalence metric between the distractors of a question. To further validate the plausibility metric we compare against candidate distributions over multiple-choice questions and agreement with a ChatGPT model’s interpretation of distractor plausibility and diversity.

Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages
Yixuan Wang | Qingyan Chen | Duygu Ataman

Language generation has been an important task in natural language processing (NLP) with increasing variety of applications especially in the recent years. The evaluation of generative language models typically rely on automatic heuristics which search for overlaps over word or phrase level patterns in generated outputs and traditionally some hand-crafted reference sentences in the given language ranging in the forms from sentences to entire documents. Language, on the other hand, is productive by nature, which means the same concept can be expressed potentially in many different lexical or phrasal forms, making the assessment of generated outputs a very difficult one. Many studies have indicated potential hazards related to the prominent choice of heuristics matching generated language to selected references and the limitations raised by this setting in developing robust generative models. This paper undertakes an in-depth analysis of evaluation metrics used for generative models, specifically investigating their responsiveness to various syntactic structures, and how these characteristics vary across languages with different morphosyntactic typologies. Preliminary findings indicate that while certain metrics exhibit robustness in particular linguistic contexts, a discernible variance emerges in their performance across distinct syntactic forms. Through this exploration, we highlight the imperative need for more nuanced and encompassing evaluation strategies in generative models, advocating for metrics that are sensitive to the multifaceted nature of languages.

EduQuick: A Dataset Toward Evaluating Summarization of Informal Educational Content for Social Media
Zahra Kolagar | Sebastian Steindl | Alessandra Zarcone

This study explores the capacity of large language models (LLMs) to efficiently generate summaries of informal educational content tailored for platforms like TikTok. It also investigates how both humans and LLMs assess the quality of these summaries, based on a series of experiments, exploring the potential replacement of human evaluation with LLMs. Furthermore, the study delves into how experienced content creators perceive the utility of automatic summaries for TikTok videos. We employ strategic prompt selection techniques to guide LLMs in producing engaging summaries based on the characteristics of viral TikTok content, including hashtags, captivating hooks, storytelling, and user engagement. The study leverages OpenAI’s GPT-4 model to generate TikTok content summaries, aiming to align them with the essential features identified. By employing this model and incorporating human evaluation and expert assessment, this research endeavors to shed light on the intricate dynamics of modern content creation, where AI and human ingenuity converge. Ultimately, it seeks to enhance strategies for disseminating and evaluating educational information effectively in the realm of social media.

Zero-shot Probing of Pretrained Language Models for Geography Knowledge
Nitin Ramrakhiyani | Vasudeva Varma | Girish Palshikar | Sachin Pawar

Gauging the knowledge of Pretrained Language Models (PLMs) about facts in niche domains is an important step towards making them better in those domains. In this paper, we aim at evaluating multiple PLMs for their knowledge about world Geography. We contribute (i) a sufficiently sized dataset of masked Geography sentences to probe PLMs on masked token prediction and generation tasks, (ii) benchmark the performance of multiple PLMs on the dataset. We also provide a detailed analysis of the performance of the PLMs on different Geography facts.

Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-End
Yanran Chen | Steffen Eger

We consider the end-to-end abstract-to-title generation problem, exploring seven recent transformer based models (including ChatGPT) fine-tuned on more than 30k abstract-title pairs from NLP and machine learning (ML) venues. As an extension, we also consider the harder problem of generating humorous paper titles. For the latter, we compile the first large-scale humor annotated dataset for scientific papers in the NLP/ML domains, comprising 2.6k titles. We evaluate all models using human and automatic metrics. Our human evaluation suggests that our best end-to-end system per-forms similarly to human authors (but arguably slightly worse). Generating funny titles is more difficult, however, and our automatic systems clearly underperform relative to humans and often learn dataset artefacts of humor. Finally, ChatGPT, without any fine-tuning, performs on the level of our best fine-tuned system.

Summary Cycles: Exploring the Impact of Prompt Engineering on Large Language Models’ Interaction with Interaction Log Information
Jeremy Block | Yu-Peng Chen | Abhilash Budharapu | Lisa Anthony | Bonnie Dorr

With the aim of improving work efficiency, we examine how Large Language Models (LLMs) can better support the handoff of information by summarizing user interactions in collaborative intelligence analysis communication. We experiment with interaction logs, or a record of user interactions with a system. Inspired by chain-of-thought prompting, we describe a technique to avoid API token limits with recursive summarization requests. We then apply ChatGPT over multiple iterations to extract named entities, topics, and summaries, combined with interaction sequence sentences, to generate summaries of critical events and results of analysis sessions. We quantitatively evaluate the generated summaries against human-generated ones using common accuracy metrics (e.g., ROUGE-L, BLEU, BLEURT, and TER). We also report qualitative trends and the factuality of the output. We find that manipulating the audience feature or providing single-shot examples minimally influences the model’s accuracy. While our methodology successfully summarizes interaction logs, the lack of significant results raises questions about prompt engineering and summarization effectiveness generally. We call on explainable artificial intelligence research to better understand how terms and their placement may change LLM outputs, striving for more consistent prompt engineering guidelines.

Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content
Savita Bhat | Vasudeva Varma

The process of collecting human-generated annotations is time-consuming and resource-hungry. In the case of low-resource (LR) languages such as Indic languages, these efforts are more expensive due to the dearth of data and human experts. Considering their importance in solving downstream applications, there have been concentrated efforts exploring alternatives for human-generated annotations. To that extent, we seek to evaluate multilingual large language models (LLMs) for their potential to substitute or aid human-generated annotation efforts. We use LLMs to re-label publicly available datasets in LR languages for the tasks of natural language inference, sentiment analysis, and news classification. We compare these annotations with existing ground truth labels to analyze the efficacy of using LLMs for annotation tasks. We observe that the performance of these LLMs varies substantially across different tasks and languages. The results show that off-the-shelf use of multilingual LLMs is not appropriate and results in poor performance in two of the three tasks.

Can a Prediction’s Rank Offer a More Accurate Quantification of Bias? A Case Study Measuring Sexism in Debiased Language Models
Jad Doughman | Shady Shehata | Leen Al Qadi | Youssef Nafea | Fakhri Karray

Pre-trained language models are known to inherit a plethora of contextual biases from their training data. These biases have proven to be projected onto a variety of downstream applications, making their detection and mitigation imminent. Limited research has been conducted to quantify specific bias types, such as benevolent sexism, which may be subtly present within the inferred connotations of a sentence. To this extent, our work aims to: (1) provide a benchmark of sexism sentences; (2) adapt two bias metrics: mean probability score and mean normalized rank; (3) conduct a case study to quantify and analyze sexism in base and de-biased masked language models. We find that debiasing, even in its most effective form (Auto-Debias), solely nullifies the probability score of biasing tokens, while retaining them in high ranks. Auto-Debias illustrates a 90%-96% reduction in mean probability scores from base to debiased models, while only a 3%-16% reduction in mean normalized ranks. Similar to the application of non-parametric statistical tests for data that does not follow a normal distribution, operating on the ranks of predictions rather than their probability scores offers a more representative bias measure.

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics
Christoph Leiter | Juri Opitz | Daniel Deutsch | Yang Gao | Rotem Dror | Steffen Eger

Generative large language models (LLMs) have seen many breakthroughs over the last year. With an increasing number of parameters and pre-training data, they have shown remarkable capabilities to solve tasks with minimal or no task-related examples. Notably, LLMs have been successfully employed as evaluation metrics in text generation tasks. Strategies employed in this context differ in the choice of input prompts, the selection of samples for demonstration, and the methodology used to construct scores grading the generations. Approaches often differ in the input prompts, the samples that are selected for demonstration and the construction process of scores from the output. Within this context, we introduce the Eval4NLP 2023 shared task that asks participants to explore such approaches for machine translation evaluation and summarization eval- uation. Specifically, we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting. We test the approaches of the participants on a new reference-free test-set spanning 3 language pairs for machine transla- tion as well as a summarization dataset. Further, we present an overview of the approaches taken by the participants, present their results on the test set and analyze paths for future work. Fi- nally, as a separate track, we perform a human evaluation of the plausibility of explanations given by the LLMs and its effect on model performance. We make parts of our code and datasets available.

HIT-MI&T Lab’s Submission to Eval4NLP 2023 Shared Task
Rui Zhang | Fuhai Song | Hui Huang | Jinghao Yuan | Muyun Yang | Tiejun Zhao

Recently, Large Language Models (LLMs) have boosted the research in natural language processing and shown impressive capabilities across numerous domains, including machine translation evaluation. This paper presents our methods developed for the machine translation evaluation sub-task of the Eval4NLP 2023 Shared Task. Based on the provided LLMs, we propose a generation-based method as well as a probability-based method to perform evaluation, explore different strategies when selecting the demonstrations for in-context learning, and try different ensemble methods to further improve the evaluation accuracy. The experiment results on the development set and test set demonstrate the effectiveness of our proposed method.

Understanding Large Language Model Based Metrics for Text Summarization
Abhishek Pradhan | Ketan Todi

This paper compares the two most widely used techniques for evaluating generative tasks with large language models (LLMs): prompt-based evaluation and log-likelihood evaluation as part of the Eval4NLP shared task. We focus on the summarization task and evaluate both small and large LLM models. We also study the impact of LLAMA and LLAMA 2 on summarization, using the same set of prompts and techniques. We used the Eval4NLP dataset for our comparison. This study provides evidence of the advantages of prompt-based evaluation techniques over log-likelihood based techniques, especially for large models and models with better reasoning power.

LTRC_IIITH’s 2023 Submission for Prompting Large Language Models as Explainable Metrics Task
Pavan Baswani | Ananya Mukherjee | Manish Shrivastava

In this report, we share our contribution to the Eval4NLP Shared Task titled “Prompting Large Language Models as Explainable Metrics.” We build our prompts with a primary focus on effective prompting strategies, score-aggregation, and explainability for LLM-based metrics. We participated in the track for smaller models by submitting the scores along with their explanations. According to the Kendall correlation scores on the leaderboard, our MT evaluation submission ranks second-best, while our summarization evaluation submission ranks fourth, with only a 0.06 difference from the leading submission.

Which is better? Exploring Prompting Strategy For LLM-based Metrics
JoongHoon Kim | Sangmin Lee | Seung Hun Han | Saeran Park | Jiyoon Lee | Kiyoon Jeong | Pilsung Kang

This paper describes the DSBA submissions to the Prompting Large Language Models as Explainable Metrics shared task, where systems were submitted to two tracks: small and large summarization tracks. With advanced Large Language Models (LLMs) such as GPT-4, evaluating the quality of Natural Language Generation (NLG) has become increasingly paramount. Traditional similarity-based metrics such as BLEU and ROUGE have shown to misalign with human evaluation and are ill-suited for open-ended generation tasks. To address this issue, we explore the potential capability of LLM-based metrics, especially leveraging open-source LLMs. In this study, wide range of prompts and prompting techniques are systematically analyzed with three approaches: prompting strategy, score aggregation, and explainability. Our research focuses on formulating effective prompt templates, determining the granularity of NLG quality scores and assessing the impact of in-context examples on LLM-based evaluation. Furthermore, three aggregation strategies are compared to identify the most reliable method for aggregating NLG quality scores. To examine explainability, we devise a strategy that generates rationales for the scores and analyzes the characteristics of the explanation produced by the open-source LLMs. Extensive experiments provide insights regarding evaluation capabilities of open-source LLMs and suggest effective prompting strategies.

Characterised LLMs Affect its Evaluation of Summary and Translation
Yu-An Lu | Yu-Ting Lin

In today’s widespread use of Large Language Models (LLMs), there have been significant achievements in various text domains such as generating summaries and translations. However, there is still room for development and improvement in evaluating the outputs of LLMs. In this paper, we propose an innovative scoring system that assesses the quality of summaries and translations using multiple metrics, we also enhance LLM’s performance in scoring tasks by assigning it different roles, effectively making it act as an expert. We test four roles in the study: a teacher, a proofreader, a travel writer, and an internet troll, comparing the advantages and disadvantages of each role in the scoring task. Our research results demonstrate that emphasizing LLM’s multilingual capabilities and strict standards as its identity can effectively boost its performance. Additionally, imbuing LLM with a more critical thinking ability enhances its performance in translation tasks compared to a milder LLM identity. In summary, we show that assigning different identities to LLM can influence its performance in scoring tasks. We believe that this research will contribute to the use of LLMs for scoring purposes.

Reference-Free Summarization Evaluation with Large Language Models
Abbas Akkasi | Kathleen Fraser | Majid Komeili

With the continuous advancement in unsupervised learning methodologies, text generation has become increasingly pervasive. However, the evaluation of the quality of the generated text remains challenging. Human annotations are expensive and often show high levels of disagreement, in particular for certain tasks characterized by inherent subjectivity, such as translation and summarization.Consequently, the demand for automated metrics that can reliably assess the quality of such generative systems and their outputs has grown more pronounced than ever. In 2023, Eval4NLP organized a shared task dedicated to the automatic evaluation of outputs from two specific categories of generative systems: machine translation and summarization. This evaluation was achieved through the utilization of prompts with Large Language Models. Participating in the summarization evaluation track, we propose an approach that involves prompting LLMs to evaluate six different latent dimensions of summarization quality. In contrast to many previous approaches to summarization assessments, which emphasize lexical overlap with reference text, this method surfaces the importance of correct syntax in summarization evaluation. Our method resulted in the second-highest performance in this shared task, demonstrating its effectiveness as a reference-free evaluation.

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared Task
Neema Kotonya | Saran Krishnasamy | Joel Tetreault | Alejandro Jaimes

This paper describes and analyzes our participation in the 2023 Eval4NLP shared task, which focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation, particularly in the context of evaluating machine translations and summaries. We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting. In addition, we integrated these approaches with zero-shot and one-shot learning methods to maximize the efficacy of our evaluation procedures. Our work reveals that combining these approaches using a “small”, open source model (orca_mini_v3_7B) yields competitive results.

Exploring Prompting Large Language Models as Explainable Metrics
Ghazaleh Mahmoudi

This paper describes the IUST NLP Lab submission to the Prompting Large Language Models as Explainable Metrics Shared Task at the Eval4NLP 2023 Workshop on Evaluation & Comparison of NLP Systems. We have proposed a zero-shot prompt-based strategy for explainable evaluation of the summarization task using Large Language Models (LLMs). The conducted experiments demonstrate the promising potential of LLMs as evaluation metrics in Natural Language Processing (NLP), particularly in the field of summarization. Both few-shot and zero-shot approaches are employed in these experiments. The performance of our best provided prompts achieved a Kendall correlation of 0.477 with human evaluations in the text summarization task on the test data.

Team NLLG submission for Eval4NLP 2023 Shared Task: Retrieval-Augmented In-Context Learning for NLG Evaluation
Daniil Larionov | Vasiliy Viskov | George Kokush | Alexander Panchenko | Steffen Eger

In this paper, we propose a retrieval-augmented in-context learning for natural language generation (NLG) evaluation. This method allows practitioners to utilize large language models (LLMs) for various NLG evaluation tasks without any fine-tuning. We apply our approach to Eval4NLP 2023 Shared Task in translation evaluation and summarization evaluation subtasks. The findings suggest that retrieval-augmented in-context learning is a promising approach for creating LLM-based evaluation metrics for NLG. Further research directions include exploring the performance of various publicly available LLM models and identifying which LLM properties help boost the quality of the metric.

pdf (full)
bib (full) Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing
Chung-Chi Chen | Hen-Hsen Huang | Hiroya Takamura | Hsin-Hsi Chen | Hiroki Sakaji | Kiyoshi Izumi

Large Language Model Adaptation for Financial Sentiment Analysis
Pau Rodriguez Inserte | Mariam Nakhlé | Raheel Qader | Gaetan Caillaut | Jingshu Liu

Natural language processing (NLP) has recently gained relevance within financial institutions by providing highly valuable insights into companies and markets’ financial documents. However, the landscape of the financial domain presents extra challenges for NLP, due to the complexity of the texts and the use of specific terminology. Generalist language models tend to fall short in tasks specifically tailored for finance, even when using large language models (LLMs) with great natural language understanding and generative capabilities. This paper presents a study on LLM adaptation methods targeted at the financial domain and with high emphasis on financial sentiment analysis. To this purpose, two foundation models with less than 1.5B parameters have been adapted using a wide range of strategies. We show that through careful fine-tuning on both financial documents and instructions, these foundation models can be adapted to the target domain. Moreover, we observe that small LLMs have comparable performance to larger scale models, while being more efficient in terms of parameters and data. In addition to the models, we show how to generate artificial instructions through LLMs to augment the number of samples of the instruction dataset.

From Numbers to Words: Multi-Modal Bankruptcy Prediction Using the ECL Dataset
Henri Arno | Klaas Mulier | Joke Baeck | Thomas Demeester

In this paper, we present ECL, a novel multimodal dataset containing the textual and numerical data from corporate 10K filings and associated binary bankruptcy labels. Furthermore, we develop and critically evaluate several classical and neural bankruptcy prediction models using this dataset. Our findings suggest that the information contained in each data modality is complementary for bankruptcy prediction. We also see that the binary bankruptcy prediction target does not enable our models to distinguish next year bankruptcy from an unhealthy financial situation resulting in bankruptcy in later years. Finally, we explore the use of LLMs in the context of our task. We show how GPT-based models can be used to extract meaningful summaries from the textual data but zero-shot bankruptcy prediction results are poor. All resources required to access and update the dataset or replicate our experiments are available on github.com/henriarnoUG/ECL.

Headline Generation for Stock Price Fluctuation Articles
Shunsuke Nishida | Yuki Zenimoto | Xiaotian Wang | Takuya Tamura | Takehito Utsuro

The purpose of this paper is to construct a model for the generation of sophisticated headlines pertaining to stock price fluctuation articles, derived from the articles’ content. With respect to this headline generation objective, this paper solves three distinct tasks: in addition to the task of generating article headlines, two other tasks of extracting security names, and ascertaining the trajectory of stock prices, whether they are rising or declining. Regarding the headline generation task, we also revise the task as the model utilizes the outcomes of the security name extraction and rise/decline determination tasks, thereby for the purpose of preventing the inclusion of erroneous security names. We employed state-of-the-art pre-trained models from the field of natural language processing, fine-tuning these models for each task to enhance their precision. The dataset utilized for fine-tuning comprises a collection of articles delineating the rise and decline of stock prices. Consequently, we achieved remarkably high accuracy in the dual tasks of security name extraction and stock price rise or decline determination. For the headline generation task, a significant portion of the test data yielded fitting headlines.

Audit Report Coverage Assessment using Sentence Classification
Sushodhan Vaishampayan | Nitin Ramrakhiyani | Sachin Pawar | Aditi Pawde | Manoj Apte | Girish Palshikar

Audit reports are a window to the financial health of a company and hence gauging coverage of various audit aspects in them is important. In this paper, we aim at determining an audit report’s coverage through classification of its sentences into multiple domain specific classes. In a weakly supervised setting, we employ a rule-based approach to automatically create training data for a BERT-based multi-label classifier. We then devise an ensemble to combine both the rule based and classifier approaches. Further, we employ two novel ways to improve the ensemble’s generalization: (i) through an active learning based approach and, (ii) through a LLM based review. We demonstrate that our proposed approaches outperform several baselines. We show utility of the proposed approaches to measure audit coverage on a large dataset of 2.8K audit reports.

GPT-FinRE: In-context Learning for Financial Relation Extraction using Large Language Models
Pawan Rajpoot | Ankur Parikh

Relation extraction (RE) is a crucial task in natural language processing (NLP) that aims to identify and classify relationships between entities mentioned in text. In the financial domain, relation extraction plays a vital role in extracting valuable information from financial documents, such as news articles, earnings reports, and company filings. This paper describes our solution to relation extraction on one such dataset REFinD. The dataset was released along with shared task as a part of the Fourth Workshop on Knowledge Discovery from Unstructured Data in Financial Services, co-located with SIGIR 2023. In this paper, we employed OpenAI models under the framework of in-context learning (ICL). We utilized two retrieval strategies to find top K relevant in-context learning demonstrations / examples from training data for a given test example. The first retrieval mechanism, we employed, is a learning-free dense retriever and the other system is a learning-based retriever. We were able to achieve 3rd rank overall. Our best F1-score is 0.718.

Multi-Lingual ESG Impact Type Identification
Chung-Chi Chen | Yu-Min Tseng | Juyeon Kang | Anaïs Lhuissier | Yohei Seki | Min-Yuh Day | Teng-Tsai Tu | Hsin-Hsi Chen

Assessing a company’s sustainable development goes beyond just financial metrics; the inclusion of environmental, social, and governance (ESG) factors is becoming increasingly vital. The ML-ESG shared task series seeks to pioneer discussions on news-driven ESG ratings, drawing inspiration from the MSCI ESG rating guidelines. In its second edition, ML-ESG-2 emphasizes impact type identification, offering datasets in four languages: Chinese, English, French, and Japanese. Of the 28 teams registered, 8 participated in the official evaluation. This paper presents a comprehensive overview of ML-ESG-2, detailing the dataset specifics and summarizing the performance outcomes of the participating teams.

Identifying ESG Impact with Key Information
Le Qiu | Bo Peng | Jinghang Gu | Yu-Yin Hsu | Emmanuele Chersoni

The paper presents a concise summary of our work for the ML-ESG-2 shared task, exclusively on the Chinese and English datasets. ML-ESG-2 aims to ascertain the influence of news articles on corporations, specifically from an ESG perspective. To this end, we generally explored the capability of key information for impact identification and experimented with various techniques at different levels. For instance, we attempted to incorporate important information at the word level with TF-IDF, at the sentence level with TextRank, and at the document level with summarization. The final results reveal that the one with GPT-4 for summarisation yields the best predictions.

A low resource framework for Multi-lingual ESG Impact Type Identification
Harsha Vardhan | Sohom Ghosh | Ponnurangam Kumaraguru | Sudip Naskar

With the growing interest in Green Investing, Environmental, Social, and Governance (ESG) factors related to Institutions and financial entities has become extremely important for investors. While the classification of potential ESG factors is an important issue, identifying whether the factors positively or negatively impact the Institution is also a key aspect to consider while making evaluations for ESG scores. This paper presents our solution to identify ESG impact types in four languages (English, Chinese, Japanese, French) released as shared tasks during the FinNLP workshop at the IJCNLP-AACL-2023 conference. We use a combination of translation, masked language modeling, paraphrasing, and classification to solve this problem and use a generalized pipeline that performs well across all four languages. Our team ranked 1st in the Chinese and Japanese sub-tasks.

GPT-based Solution for ESG Impact Type Identification
Anna Polyanskaya | Lucas Fernández Brillet

In this paper, we present our solutions to the ML-ESG-2 shared task which is co-located with the FinNLP workshop at IJCNLP-AACL-2023. The task proposes an objective of binary classification of ESG-related news based on what type of impact they can have on a company - Risk or Opportunity. We report the results of three systems, which ranked 2nd, 9th, and 10th in the final leaderboard for the English language, with the best solution achieving over 0.97 in F1 score.

The Risk and Opportunity of Data Augmentation and Translation for ESG News Impact Identification with Language Models
Yosef Ardhito Winatmoko | Ali Septiandri

This paper presents our findings in the ML-ESG-2 task, which focused on classifying a news snippet of various languages as “Risk” or “Opportunity” in the ESG (Environmental, Social, and Governance) context. We experimented with data augmentation and translation facilitated by Large Language Models (LLM). We found that augmenting the English dataset did not help to improve the performance. By fine-tuning RoBERTa models with the original data, we achieved the top position for the English and second place for the French task. In contrast, we could achieve comparable results on the French dataset by solely using the English translation, securing the third position for the French task with only marginal F1 differences to the second-place model.

ESG Impact Type Classification: Leveraging Strategic Prompt Engineering and LLM Fine-Tuning
Soumya Mishra

In this paper, we describe our approach to the ML-ESG-2 shared task, co-located with the FinNLP workshop at IJCNLP-AACL-2023. The task aims at classifying news articles into categories reflecting either “Opportunity” or “Risk” from an ESG standpoint for companies. Our innovative methodology leverages two distinct systems for optimal text classification. In the initial phase, we engage in prompt engineering, working in conjunction with semantic similarity and using the Claude 2 LLM. Subsequently, we apply fine-tuning techniques to the Llama 2 and Dolly LLMs to enhance their performance. We report the results of five different approaches in this paper, with our top models ranking first in the French category and sixth in the English category.

Exploring Knowledge Composition for ESG Impact Type Determination
Fabian Billert | Stefan Conrad

In this paper, we discuss our (Team HHU’s) submission to the Multi-Lingual ESG Impact Type Identification task (ML-ESG-2). The goal of this task is to determine if an ESG-related news article represents an opportunity or a risk. We use an adapter-based framework in order to train multiple adapter modules which capture different parts of the knowledge present in the training data. Experimenting with various Adapter Fusion setups, we focus both on combining the ESG-aspect-specific knowledge, and on combining the language-specific-knowledge. Our results show that in both cases, it is possible to effectively compose the knowledge in order to improve the impact type determination.

Enhancing ESG Impact Type Identification through Early Fusion and Multilingual Models
Hariram Veeramani | Surendrabikram Thapa | Usman Naseem

In the evolving landscape of Environmental, Social, and Corporate Governance (ESG) impact assessment, the ML-ESG-2 shared task proposes identifying ESG impact types. To address this challenge, we present a comprehensive system leveraging ensemble learning techniques, capitalizing on early and late fusion approaches. Our approach employs four distinct models: mBERT, FlauBERT-base, ALBERT-base-v2, and a Multi-Layer Perceptron (MLP) incorporating Latent Semantic Analysis (LSA) and Term Frequency-Inverse Document Frequency (TF-IDF) features. Through extensive experimentation, we find that our early fusion ensemble approach, featuring the integration of LSA, TF-IDF, mBERT, FlauBERT-base, and ALBERT-base-v2, delivers the best performance. Our system offers a comprehensive ESG impact type identification solution, contributing to the responsible and sustainable decision-making processes vital in today’s financial and corporate governance landscape.

bib (full) Proceedings of the First Workshop in South East Asian Language Processing

Proceedings of the First Workshop in South East Asian Language Processing
Derry Wijaya | Alham Fikri Aji | Clara Vania | Genta Indra Winata | Ayu Purwarianti

Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings
Dan John Velasco | Axel Alba | Trisha Gail Pelagio | Bryce Anthony Ramirez | Jan Christian Blaise Cruz | Unisse Chua | Briane Paul Samson | Charibeth Cheng

Developing a Named Entity Recognition Dataset for Tagalog
Lester James V. Miranda

Balarila: Deep Learning for Semantic Grammar Error Correction in Low-Resource Settings
Paolo Espiritu | Joshue Jadie | Andre Ponce | Charibeth Cheng

Utilizing Weak Supervision to Generate Indonesian Conservation Datasets
Mega Fransiska | Diah Pitaloka | Saripudin | Satrio Putra | Lintang Sutawika

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning
Samuel Cahyawijaya | Holy Lovenia | Tiezheng Yu | Willy Chung | Pascale Fung

SentMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Sentiment Analysis
Md Nishat Raihan | Dhiman Goswami | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems
Muhammad Kautsar | Rahmah Nurdini | Samuel Cahyawijaya | Genta Winata | Ayu Purwarianti

Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia
Lucky Susanto | Ryandito Diandaru | Adila Krisnadhi | Ayu Purwarianti | Derry Tanti Wijaya

bib (full) Proceedings of the 11th International Workshop on Natural Language Processing for Social Media

Proceedings of the 11th International Workshop on Natural Language Processing for Social Media
Lun-Wei Ku | Cheng-Te Li

Temporal Tides of Emotional Resonance: A Novel Approach to Identify Mental Health on Social Media
Usman Naseem | Surendrabikram Thapa | Qi Zhang | Junaid Rashid | Liang Hu | Mehwish Nasim

Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models
Mahammed Kamruzzaman | Gene Kim

OffMix-3L: A Novel Code-Mixed Test Dataset in Bangla-English-Hindi for Offensive Language Identification
Dhiman Goswami | Md Nishat Raihan | Antara Mahmud | Antonios Anastasopoulos | Marcos Zampieri

An Emotion-Enriched and Psycholinguistics Features-Based Approach for Rumor Detection on Online Social Media
Asimul Haque | Muhammad Abulaish

The Future of Meat: Sentiment Analysis of Food Tweets
Matiss Rikters | Maija Kāle

Boosting Adverse Drug Event Normalization on Social Media: General-Purpose Model Initialization and Biomedical Semantic Text Similarity Benefit Zero-Shot Linking in Informal Contexts
François Remy | Simone Scaboro | Beatrice Portelli

bib (full) Proceedings of the Second Workshop on Information Extraction from Scientific Publications

Proceedings of the Second Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal | Felix Grezes | Thomas Allen | Kelly Lockhart | Alberto Accomazzi | Sergi Blanco-Cuaresma

Investigating the Impact of Syntax-Enriched Transformers on Quantity Extraction in Scientific Texts
Necva Bölücü | Maciej Rybinski | Stephen Wan

NanoNER: Named Entity Recognition for Nanobiology Using Experts’ Knowledge and Distant Supervision
Ran Cheng | Martin Lentschat | Cyril Labbe

Relation Extraction from Scientific Texts in Russian with Limited Training Data
Olga Tikhobaeva | Elena Bruches

Extracting Definienda in Mathematical Scholarly Articles with Transformers
Shufan Jiang | Pierre Senellart

A Novel Dataset Towards Extracting Virus-Host Interactions
Rasha R. Alshawi | Atriya Sen | Nathan S. Upham | Beckett Sterner

Detection of Tortured Phrases in Scientific Literature
Eléna Martel | Martin Lentschat | Cyril Labbe

LaTeX Rainbow: Universal LaTeX to PDF Document Semantic & Layout Annotation Framework
Changxu Duan | Zhiyin Tan | Sabine Bartsch

Leveraging the Fusion-in-Decoder for Label Classification
Azumi Okuda | Hideya Mino | Taro Miyazaki | Jun Goto

Enhancing Academic Title Generation Using SciBERT and Linguistic Rules
Elena Callegari | Peter Vajdecka | Desara Xhura | Anton Karl Ingason

MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain
Timo Pierre Schrader | Matteo Finco | Stefan Grünewald | Felix Hildebrand | Annemarie Friedrich

An End-to-End Pipeline for Bibliography Extraction from Scientific Articles
Bikash Joshi | Anthi Symeonidou | Syed Mazin Danish | Floris Hermsen

Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
Charlie George | Andreas Stuhlmueller

APCS: Towards Argument Based Pros and Cons Summarization of Peer Reviews
Sandeep Kumar | Tirthankar Ghosal | Asif Ekbal

On the Use of Language Models for Function Identification of Citations in Scholarly Papers
Tomoki Ikoma | Shigeki Matsubara

Automated Citation Function Classification and Context Extraction in Astrophysics: Leveraging Paraphrasing and Question Answering
Hariram Veeramani | Surendrabikram Thapa | Usman Naseem

Function of Citation in Astrophysics Literature (FOCAL): Findings of the Shared Task
Felix Grezes | Thomas Allen | Tirthankar Ghosal | Sergi Blanco-Cuaresma