Two-stage Pipeline for Multilingual Dialect Detection

Ankit Vaidya; Aditya Kane

doi:10.18653/v1/2023.vardial-1.22

Two-stage Pipeline for Multilingual Dialect Detection

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Dialect Identification is a crucial task for localizing various Large Language Models. This paper outlines our approach to the VarDial 2023 shared task. Here we have to identify three or two dialects from three languages each which results in a 9-way classification for Track-1 and 6-way classification for Track-2 respectively. Our proposed approach consists of a two-stage system and outperforms other participants’ systems and previous works in this domain. We achieve a score of 58.54% for Track-1 and 85.61% for Track-2. Our codebase is available publicly (https://github.com/ankit-vaidya19/EACL_VarDial2023).

Anthology ID:: 2023.vardial-1.22
Volume:: Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 222–229
Language:
URL:: https://aclanthology.org/2023.vardial-1.22/
DOI:: 10.18653/v1/2023.vardial-1.22
Bibkey:
Cite (ACL):: Ankit Vaidya and Aditya Kane. 2023. Two-stage Pipeline for Multilingual Dialect Detection. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 222–229, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Two-stage Pipeline for Multilingual Dialect Detection (Vaidya & Kane, VarDial 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.vardial-1.22.pdf
Video:: https://aclanthology.org/2023.vardial-1.22.mp4

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{vaidya-kane-2023-two,
    title = "Two-stage Pipeline for Multilingual Dialect Detection",
    author = "Vaidya, Ankit  and
      Kane, Aditya",
    editor = {Scherrer, Yves  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Nakov, Preslav  and
      Tiedemann, J{\"o}rg  and
      Zampieri, Marcos},
    booktitle = "Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.vardial-1.22/",
    doi = "10.18653/v1/2023.vardial-1.22",
    pages = "222--229",
    abstract = "Dialect Identification is a crucial task for localizing various Large Language Models. This paper outlines our approach to the VarDial 2023 shared task. Here we have to identify three or two dialects from three languages each which results in a 9-way classification for Track-1 and 6-way classification for Track-2 respectively. Our proposed approach consists of a two-stage system and outperforms other participants' systems and previous works in this domain. We achieve a score of 58.54{\%} for Track-1 and 85.61{\%} for Track-2. Our codebase is available publicly (\url{https://github.com/ankit-vaidya19/EACL_VarDial2023})."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="vaidya-kane-2023-two">
    <titleInfo>
        <title>Two-stage Pipeline for Multilingual Dialect Detection</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Ankit</namePart>
        <namePart type="family">Vaidya</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Aditya</namePart>
        <namePart type="family">Kane</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Yves</namePart>
            <namePart type="family">Scherrer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tommi</namePart>
            <namePart type="family">Jauhiainen</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Nikola</namePart>
            <namePart type="family">Ljubešić</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Preslav</namePart>
            <namePart type="family">Nakov</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jörg</namePart>
            <namePart type="family">Tiedemann</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marcos</namePart>
            <namePart type="family">Zampieri</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Dubrovnik, Croatia</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Dialect Identification is a crucial task for localizing various Large Language Models. This paper outlines our approach to the VarDial 2023 shared task. Here we have to identify three or two dialects from three languages each which results in a 9-way classification for Track-1 and 6-way classification for Track-2 respectively. Our proposed approach consists of a two-stage system and outperforms other participants’ systems and previous works in this domain. We achieve a score of 58.54% for Track-1 and 85.61% for Track-2. Our codebase is available publicly (https://github.com/ankit-vaidya19/EACL_VarDial2023).</abstract>
    <identifier type="citekey">vaidya-kane-2023-two</identifier>
    <identifier type="doi">10.18653/v1/2023.vardial-1.22</identifier>
    <location>
        <url>https://aclanthology.org/2023.vardial-1.22/</url>
    </location>
    <part>
        <date>2023-05</date>
        <extent unit="page">
            <start>222</start>
            <end>229</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Two-stage Pipeline for Multilingual Dialect Detection
%A Vaidya, Ankit
%A Kane, Aditya
%Y Scherrer, Yves
%Y Jauhiainen, Tommi
%Y Ljubešić, Nikola
%Y Nakov, Preslav
%Y Tiedemann, Jörg
%Y Zampieri, Marcos
%S Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
%D 2023
%8 May
%I Association for Computational Linguistics
%C Dubrovnik, Croatia
%F vaidya-kane-2023-two
%X Dialect Identification is a crucial task for localizing various Large Language Models. This paper outlines our approach to the VarDial 2023 shared task. Here we have to identify three or two dialects from three languages each which results in a 9-way classification for Track-1 and 6-way classification for Track-2 respectively. Our proposed approach consists of a two-stage system and outperforms other participants’ systems and previous works in this domain. We achieve a score of 58.54% for Track-1 and 85.61% for Track-2. Our codebase is available publicly (https://github.com/ankit-vaidya19/EACL_VarDial2023).
%R 10.18653/v1/2023.vardial-1.22
%U https://aclanthology.org/2023.vardial-1.22/
%U https://doi.org/10.18653/v1/2023.vardial-1.22
%P 222-229

Download as File

Markdown (Informal)

[Two-stage Pipeline for Multilingual Dialect Detection](https://aclanthology.org/2023.vardial-1.22/) (Vaidya & Kane, VarDial 2023)

Two-stage Pipeline for Multilingual Dialect Detection (Vaidya & Kane, VarDial 2023)

ACL

Ankit Vaidya and Aditya Kane. 2023. Two-stage Pipeline for Multilingual Dialect Detection. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 222–229, Dubrovnik, Croatia. Association for Computational Linguistics.