Hemolix.TabGen: Optimized Table Generation from Documents

Gyanendra Shrestha; Todor Ivanov; Karthik Vemireddy; Anna Pyayt; Michael Gubanov

Hemolix.TabGen: Optimized Table Generation from Documents

Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, Michael Gubanov

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs

Anthology ID:: 2026.acl-industry.73
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Yunyao Li, Georg Rehm, Mei Tu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1055–1066
Language:
URL:: https://aclanthology.org/2026.acl-industry.73/
DOI:
Bibkey:
Cite (ACL):: Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, and Michael Gubanov. 2026. Hemolix.TabGen: Optimized Table Generation from Documents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1055–1066, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Hemolix.TabGen: Optimized Table Generation from Documents (Shrestha et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-industry.73.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{shrestha-etal-2026-hemolix,
    title = "{H}emolix.{T}ab{G}en: Optimized Table Generation from Documents",
    author = "Shrestha, Gyanendra  and
      Ivanov, Todor  and
      Vemireddy, Karthik  and
      Pyayt, Anna  and
      Gubanov, Michael",
    editor = "Li, Yunyao  and
      Rehm, Georg  and
      Tu, Mei",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics ({ACL} 2026)",
    month = jul,
    year = "2026",
    address = "San Diego, California, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-industry.73/",
    pages = "1055--1066",
    ISBN = "979-8-89176-394-4",
    abstract = "Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid {---} i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30{\%} compared to vanilla LLMs"
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="shrestha-etal-2026-hemolix">
    <titleInfo>
        <title>Hemolix.TabGen: Optimized Table Generation from Documents</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Gyanendra</namePart>
        <namePart type="family">Shrestha</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Todor</namePart>
        <namePart type="family">Ivanov</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Karthik</namePart>
        <namePart type="family">Vemireddy</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Anna</namePart>
        <namePart type="family">Pyayt</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Michael</namePart>
        <namePart type="family">Gubanov</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2026-07</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Yunyao</namePart>
            <namePart type="family">Li</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Georg</namePart>
            <namePart type="family">Rehm</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mei</namePart>
            <namePart type="family">Tu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">San Diego, California, USA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-8-89176-394-4</identifier>
    </relatedItem>
    <abstract>Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs</abstract>
    <identifier type="citekey">shrestha-etal-2026-hemolix</identifier>
    <location>
        <url>https://aclanthology.org/2026.acl-industry.73/</url>
    </location>
    <part>
        <date>2026-07</date>
        <extent unit="page">
            <start>1055</start>
            <end>1066</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Hemolix.TabGen: Optimized Table Generation from Documents
%A Shrestha, Gyanendra
%A Ivanov, Todor
%A Vemireddy, Karthik
%A Pyayt, Anna
%A Gubanov, Michael
%Y Li, Yunyao
%Y Rehm, Georg
%Y Tu, Mei
%S Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
%D 2026
%8 July
%I Association for Computational Linguistics
%C San Diego, California, USA
%@ 979-8-89176-394-4
%F shrestha-etal-2026-hemolix
%X Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs
%U https://aclanthology.org/2026.acl-industry.73/
%P 1055-1066

Download as File

Markdown (Informal)

[Hemolix.TabGen: Optimized Table Generation from Documents](https://aclanthology.org/2026.acl-industry.73/) (Shrestha et al., ACL 2026)

Hemolix.TabGen: Optimized Table Generation from Documents (Shrestha et al., ACL 2026)

ACL

Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, and Michael Gubanov. 2026. Hemolix.TabGen: Optimized Table Generation from Documents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1055–1066, San Diego, California, USA. Association for Computational Linguistics.