Anthology development and API

Information on how to programmatically access Anthology data


September 26, 2025

Software

The ACL Anthology is built from open-source software. The Anthology website uses the Hugo framework to generate a static website that makes heavy use of the Bootstrap library for a modern design. We use Font Awesome for icon fonts. Font Awesome is used as the icon font.

Our source code and data are available on Github.

Data organization

All the data in the ACL Anthology is stored under the data directory in our Github repository. In the xml directory are the files that contain all the Anthology metadata, in a format described below. The yaml directory contains various other important information relating to authors and venues.

Python API

In addition, we have a Python API that defines objects for papers, authors, volumes, and so on. This can be installed via pip from PyPI or built from source. For more information on that, please see our extensive developer documentation.

In addition to the documentation, there are many examples of using the module in the scripts our bin directory. The create_hugo_yaml.py, for example, demonstrates how we generate YAML data structures to build our static site.

Authoritative XML format

The Anthology site is generated from an authoritative XML file format containing information about volumes, paper titles, and authors. This data is stored in the official repository on Github. Here is a fragment of a complete XML file (P18.xml), to give you the idea. The full file contains much more information.

<?xml version="1.0" encoding="UTF-8" ?>
<collection id="P18">
  <volume id="3">
    <meta>
      <booktitle>Proceedings of <fixed-case>ACL</fixed-case> 2018, Student Research Workshop</booktitle>
      <editor><first>Vered</first><last>Shwartz</last></editor>
      <url>P18-3</url>
    </meta>
    <frontmatter>
      <url>P18-3000</url>
      <!-- ... -->
    </frontmatter>
    <paper id="1">
      <title>Towards Opinion Summarization of Customer Reviews</title>
      <author><first>Samuel</first><last>Pecar</last></author>
      <url>P18-3001</url>
      <!-- ... -->
    </paper>
    <paper id="2">
      <title>Sampling Informative Training Data for <fixed-case>RNN</fixed-case> Language Models</title>
      <author><first>Jared</first><last>Fernandez</last></author>
      <author><first>Doug</first><last>Downey</last></author>
      <url>P18-3002</url>
      <!-- ... -->
    </paper>
    <paper id="3">
      <title>Learning-based Composite Metrics for Improved Caption Evaluation</title>
      <author><first>Naeha</first><last>Sharif</last></author>
      <author><first>Lyndon</first><last>White</last></author>
      <author><first>Mohammed</first><last>Bennamoun</last></author>
      <author><first>Syed Afaq</first><last>Ali Shah</last></author>
      <url>P18-3003</url>
      <!-- ... -->
    </paper>
    <!-- ...  -->
  </volume>
</collection>

Our scripts use the lxml.de library to parse the XML. You can see examples of parsing and accessing the XML directly in add_revision.py and ingest.py.