Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Hanhua Hong; Chenghao Xiao; Yang Wang; Yiqi Liu; Wenge Rong; Chenghua Lin

doi:10.1162/tacl.a.617

Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, Chenghua Lin

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardization, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.

Anthology ID:: 2026.tacl-1.31
Volume:: Transactions of the Association for Computational Linguistics, Volume 14
Month:
Year:: 2026
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 689–710
Language:
URL:: https://aclanthology.org/2026.tacl-1.31/
DOI:: 10.1162/tacl.a.617
Bibkey:
Cite (ACL):: Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, and Chenghua Lin. 2026. Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts. Transactions of the Association for Computational Linguistics, 14:689–710.
Cite (Informal):: Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts (Hong et al., TACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.tacl-1.31.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@article{hong-etal-2026-beyond,
    title = "Beyond One-Size-Fits-All: Inversion Learning for Highly Effective {NLG} Evaluation Prompts",
    author = "Hong, Hanhua  and
      Xiao, Chenghao  and
      Wang, Yang  and
      Liu, Yiqi  and
      Rong, Wenge  and
      Lin, Chenghua",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "14",
    year = "2026",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2026.tacl-1.31/",
    doi = "10.1162/tacl.a.617",
    pages = "689--710",
    abstract = "Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardization, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="hong-etal-2026-beyond">
    <titleInfo>
        <title>Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Hanhua</namePart>
        <namePart type="family">Hong</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chenghao</namePart>
        <namePart type="family">Xiao</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Yang</namePart>
        <namePart type="family">Wang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Yiqi</namePart>
        <namePart type="family">Liu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Wenge</namePart>
        <namePart type="family">Rong</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chenghua</namePart>
        <namePart type="family">Lin</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2026</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <genre authority="bibutilsgt">journal article</genre>
    <relatedItem type="host">
        <titleInfo>
            <title>Transactions of the Association for Computational Linguistics</title>
        </titleInfo>
        <originInfo>
            <issuance>continuing</issuance>
            <publisher>MIT Press</publisher>
            <place>
                <placeTerm type="text">Cambridge, MA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">periodical</genre>
        <genre authority="bibutilsgt">academic journal</genre>
    </relatedItem>
    <abstract>Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardization, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.</abstract>
    <identifier type="citekey">hong-etal-2026-beyond</identifier>
    <identifier type="doi">10.1162/tacl.a.617</identifier>
    <location>
        <url>https://aclanthology.org/2026.tacl-1.31/</url>
    </location>
    <part>
        <date>2026</date>
        <detail type="volume"><number>14</number></detail>
        <extent unit="page">
            <start>689</start>
            <end>710</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Journal Article
%T Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
%A Hong, Hanhua
%A Xiao, Chenghao
%A Wang, Yang
%A Liu, Yiqi
%A Rong, Wenge
%A Lin, Chenghua
%J Transactions of the Association for Computational Linguistics
%D 2026
%V 14
%I MIT Press
%C Cambridge, MA
%F hong-etal-2026-beyond
%X Evaluating natural language generation systems is challenging due to the diversity of valid outputs. While human evaluation is the gold standard, it suffers from inconsistencies, lack of standardization, and demographic biases, limiting reproducibility. LLM-based evaluators offer a scalable alternative but are highly sensitive to prompt design, where small variations can lead to significant discrepancies. In this work, we propose an inversion learning method that learns effective reverse mappings from model outputs back to their input instructions, enabling the automatic generation of highly effective, model-specific evaluation prompts. Our method requires only a single evaluation sample and eliminates the need for time-consuming manual prompt engineering, thereby improving both efficiency and robustness. Our work contributes toward a new direction for more robust and efficient LLM-based evaluation.
%R 10.1162/tacl.a.617
%U https://aclanthology.org/2026.tacl-1.31/
%U https://doi.org/10.1162/tacl.a.617
%P 689-710

Download as File

Markdown (Informal)

[Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts](https://aclanthology.org/2026.tacl-1.31/) (Hong et al., TACL 2026)

Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts (Hong et al., TACL 2026)

ACL

Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, and Chenghua Lin. 2026. Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts. Transactions of the Association for Computational Linguistics, 14:689–710.