SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging

Wazir Ali; Zenglin Xu; Jay Kumar

SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.

Anthology ID:: 2021.ranlp-srw.4
Volume:: Proceedings of the Student Research Workshop Associated with RANLP 2021
Month:: September
Year:: 2021
Address:: Online
Editors:: Souhila Djabri, Dinara Gimadi, Tsvetomila Mihaylova, Ivelina Nikolova-Koleva
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 22–30
Language:
URL:: https://aclanthology.org/2021.ranlp-srw.4/
DOI:
Bibkey:
Cite (ACL):: Wazir Ali, Zenglin Xu, and Jay Kumar. 2021. SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 22–30, Online. INCOMA Ltd..
Cite (Informal):: SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging (Ali et al., RANLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.ranlp-srw.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{ali-etal-2021-sipos-benchmark,
    title = "{S}i{POS}: A Benchmark Dataset for {S}indhi Part-of-Speech Tagging",
    author = "Ali, Wazir  and
      Xu, Zenglin  and
      Kumar, Jay",
    editor = "Djabri, Souhila  and
      Gimadi, Dinara  and
      Mihaylova, Tsvetomila  and
      Nikolova-Koleva, Ivelina",
    booktitle = "Proceedings of the Student Research Workshop Associated with RANLP 2021",
    month = sep,
    year = "2021",
    address = "Online",
    publisher = "INCOMA Ltd.",
    url = "https://aclanthology.org/2021.ranlp-srw.4/",
    pages = "22--30",
    abstract = "In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25{\%} is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="ali-etal-2021-sipos-benchmark">
    <titleInfo>
        <title>SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Wazir</namePart>
        <namePart type="family">Ali</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Zenglin</namePart>
        <namePart type="family">Xu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jay</namePart>
        <namePart type="family">Kumar</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2021-09</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Student Research Workshop Associated with RANLP 2021</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Souhila</namePart>
            <namePart type="family">Djabri</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Dinara</namePart>
            <namePart type="family">Gimadi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tsvetomila</namePart>
            <namePart type="family">Mihaylova</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ivelina</namePart>
            <namePart type="family">Nikolova-Koleva</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>INCOMA Ltd.</publisher>
            <place>
                <placeTerm type="text">Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.</abstract>
    <identifier type="citekey">ali-etal-2021-sipos-benchmark</identifier>
    <location>
        <url>https://aclanthology.org/2021.ranlp-srw.4/</url>
    </location>
    <part>
        <date>2021-09</date>
        <extent unit="page">
            <start>22</start>
            <end>30</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging
%A Ali, Wazir
%A Xu, Zenglin
%A Kumar, Jay
%Y Djabri, Souhila
%Y Gimadi, Dinara
%Y Mihaylova, Tsvetomila
%Y Nikolova-Koleva, Ivelina
%S Proceedings of the Student Research Workshop Associated with RANLP 2021
%D 2021
%8 September
%I INCOMA Ltd.
%C Online
%F ali-etal-2021-sipos-benchmark
%X In this paper, we introduce the SiPOS dataset for part-of-speech tagging in the low-resource Sindhi language with quality baselines. The dataset consists of more than 293K tokens annotated with sixteen universal part-of-speech categories. Two experienced native annotators annotated the SiPOS using the Doccano text annotation tool with an inter-annotation agreement of 0.872. We exploit the conditional random field, the popular bidirectional long-short-term memory neural model, and self-attention mechanism with various settings to evaluate the proposed dataset. Besides pre-trained GloVe and fastText representation, the character-level representations are incorporated to extract character-level information using the bidirectional long-short-term memory encoder. The high accuracy of 96.25% is achieved with the task-specific joint word-level and character-level representations. The SiPOS dataset is likely to be a significant resource for the low-resource Sindhi language.
%U https://aclanthology.org/2021.ranlp-srw.4/
%P 22-30

Download as File

Markdown (Informal)

[SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging](https://aclanthology.org/2021.ranlp-srw.4/) (Ali et al., RANLP 2021)

SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging (Ali et al., RANLP 2021)

ACL

Wazir Ali, Zenglin Xu, and Jay Kumar. 2021. SiPOS: A Benchmark Dataset for Sindhi Part-of-Speech Tagging. In Proceedings of the Student Research Workshop Associated with RANLP 2021, pages 22–30, Online. INCOMA Ltd..