Teanga Data Model for Linked Corpora

John Philip McCrae; Priya Rani; Adrian Doyle; Bernardo Stearns

Teanga Data Model for Linked Corpora

John P. McCrae, Priya Rani, Adrian Doyle, Bernardo Stearns

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.

Anthology ID:: 2024.ldl-1.9
Volume:: Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Christian Chiarcos, Katerina Gkirtzou, Maxim Ionov, Fahad Khan, John P. McCrae, Elena Montiel Ponsoda, Patricia Martín Chozas
Venues:: LDL | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 66–74
Language:
URL:: https://aclanthology.org/2024.ldl-1.9/
DOI:
Bibkey:
Cite (ACL):: John P. McCrae, Priya Rani, Adrian Doyle, and Bernardo Stearns. 2024. Teanga Data Model for Linked Corpora. In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pages 66–74, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Teanga Data Model for Linked Corpora (McCrae et al., LDL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.ldl-1.9.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{mccrae-etal-2024-teanga,
    title = "Teanga Data Model for Linked Corpora",
    author = "McCrae, John P.  and
      Rani, Priya  and
      Doyle, Adrian  and
      Stearns, Bernardo",
    editor = "Chiarcos, Christian  and
      Gkirtzou, Katerina  and
      Ionov, Maxim  and
      Khan, Fahad  and
      McCrae, John P.  and
      Ponsoda, Elena Montiel  and
      Chozas, Patricia Mart{\'i}n",
    booktitle = "Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.ldl-1.9/",
    pages = "66--74",
    abstract = "Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="mccrae-etal-2024-teanga">
    <titleInfo>
        <title>Teanga Data Model for Linked Corpora</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">John</namePart>
        <namePart type="given">P</namePart>
        <namePart type="family">McCrae</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Priya</namePart>
        <namePart type="family">Rani</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Adrian</namePart>
        <namePart type="family">Doyle</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Bernardo</namePart>
        <namePart type="family">Stearns</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2024-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Christian</namePart>
            <namePart type="family">Chiarcos</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Katerina</namePart>
            <namePart type="family">Gkirtzou</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Maxim</namePart>
            <namePart type="family">Ionov</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Fahad</namePart>
            <namePart type="family">Khan</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">John</namePart>
            <namePart type="given">P</namePart>
            <namePart type="family">McCrae</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Elena</namePart>
            <namePart type="given">Montiel</namePart>
            <namePart type="family">Ponsoda</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Patricia</namePart>
            <namePart type="given">Martín</namePart>
            <namePart type="family">Chozas</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>ELRA and ICCL</publisher>
            <place>
                <placeTerm type="text">Torino, Italia</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.</abstract>
    <identifier type="citekey">mccrae-etal-2024-teanga</identifier>
    <location>
        <url>https://aclanthology.org/2024.ldl-1.9/</url>
    </location>
    <part>
        <date>2024-05</date>
        <extent unit="page">
            <start>66</start>
            <end>74</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Teanga Data Model for Linked Corpora
%A McCrae, John P.
%A Rani, Priya
%A Doyle, Adrian
%A Stearns, Bernardo
%Y Chiarcos, Christian
%Y Gkirtzou, Katerina
%Y Ionov, Maxim
%Y Khan, Fahad
%Y McCrae, John P.
%Y Ponsoda, Elena Montiel
%Y Chozas, Patricia Martín
%S Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
%D 2024
%8 May
%I ELRA and ICCL
%C Torino, Italia
%F mccrae-etal-2024-teanga
%X Corpus data is the main source of data for natural language processing applications, however no standard or model for corpus data has become predominant in the field. Linguistic linked data aims to provide methods by which data can be made findable, accessible, interoperable and reusable (FAIR). However, current attempts to create a linked data format for corpora have been unsuccessful due to the verbose and specialised formats that they use. In this work, we present the Teanga data model, which uses a layered annotation model to capture all NLP-relevant annotations. We present the YAML serializations of the model, which is concise and uses a widely-deployed format, and we describe how this can be interpreted as RDF. Finally, we demonstrate three examples of the use of the Teanga data model for syntactic annotation, literary analysis and multilingual corpora.
%U https://aclanthology.org/2024.ldl-1.9/
%P 66-74

Download as File

Markdown (Informal)

[Teanga Data Model for Linked Corpora](https://aclanthology.org/2024.ldl-1.9/) (McCrae et al., LDL 2024)

Teanga Data Model for Linked Corpora (McCrae et al., LDL 2024)

ACL

John P. McCrae, Priya Rani, Adrian Doyle, and Bernardo Stearns. 2024. Teanga Data Model for Linked Corpora. In Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024, pages 66–74, Torino, Italia. ELRA and ICCL.