Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Bashar Talafha; Wael Farhan; Ahmed Altakrouri; Hussein Al-Natsheh

doi:10.18653/v1/W19-4629

Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Bashar Talafha, Wael Farhan, Ahmed Altakrouri, Hussein Al-Natsheh

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.

Anthology ID:: W19-4629
Volume:: Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:: August
Year:: 2019
Address:: Florence, Italy
Editors:: Wassim El-Hajj, Lamia Hadrich Belguith, Fethi Bougares, Walid Magdy, Imed Zitouni, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 239–243
Language:
URL:: https://aclanthology.org/W19-4629/
DOI:: 10.18653/v1/W19-4629
Bibkey:
Cite (ACL):: Bashar Talafha, Wael Farhan, Ahmed Altakrouri, and Hussein Al-Natsheh. 2019. Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 239–243, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification (Talafha et al., WANLP 2019)
Copy Citation:
PDF:: https://aclanthology.org/W19-4629.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{talafha-etal-2019-mawdoo3,
    title = "Mawdoo3 {AI} at {MADAR} Shared Task: {A}rabic Tweet Dialect Identification",
    author = "Talafha, Bashar  and
      Farhan, Wael  and
      Altakrouri, Ahmed  and
      Al-Natsheh, Hussein",
    editor = "El-Hajj, Wassim  and
      Belguith, Lamia Hadrich  and
      Bougares, Fethi  and
      Magdy, Walid  and
      Zitouni, Imed  and
      Tomeh, Nadi  and
      El-Haj, Mahmoud  and
      Zaghouani, Wajdi",
    booktitle = "Proceedings of the Fourth Arabic Natural Language Processing Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-4629/",
    doi = "10.18653/v1/W19-4629",
    pages = "239--243",
    abstract = "Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84{\%} on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="talafha-etal-2019-mawdoo3">
    <titleInfo>
        <title>Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Bashar</namePart>
        <namePart type="family">Talafha</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Wael</namePart>
        <namePart type="family">Farhan</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ahmed</namePart>
        <namePart type="family">Altakrouri</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Hussein</namePart>
        <namePart type="family">Al-Natsheh</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-08</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Fourth Arabic Natural Language Processing Workshop</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Wassim</namePart>
            <namePart type="family">El-Hajj</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Lamia</namePart>
            <namePart type="given">Hadrich</namePart>
            <namePart type="family">Belguith</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Fethi</namePart>
            <namePart type="family">Bougares</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Walid</namePart>
            <namePart type="family">Magdy</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Imed</namePart>
            <namePart type="family">Zitouni</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Nadi</namePart>
            <namePart type="family">Tomeh</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mahmoud</namePart>
            <namePart type="family">El-Haj</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Wajdi</namePart>
            <namePart type="family">Zaghouani</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Florence, Italy</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.</abstract>
    <identifier type="citekey">talafha-etal-2019-mawdoo3</identifier>
    <identifier type="doi">10.18653/v1/W19-4629</identifier>
    <location>
        <url>https://aclanthology.org/W19-4629/</url>
    </location>
    <part>
        <date>2019-08</date>
        <extent unit="page">
            <start>239</start>
            <end>243</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification
%A Talafha, Bashar
%A Farhan, Wael
%A Altakrouri, Ahmed
%A Al-Natsheh, Hussein
%Y El-Hajj, Wassim
%Y Belguith, Lamia Hadrich
%Y Bougares, Fethi
%Y Magdy, Walid
%Y Zitouni, Imed
%Y Tomeh, Nadi
%Y El-Haj, Mahmoud
%Y Zaghouani, Wajdi
%S Proceedings of the Fourth Arabic Natural Language Processing Workshop
%D 2019
%8 August
%I Association for Computational Linguistics
%C Florence, Italy
%F talafha-etal-2019-mawdoo3
%X Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.
%R 10.18653/v1/W19-4629
%U https://aclanthology.org/W19-4629/
%U https://doi.org/10.18653/v1/W19-4629
%P 239-243

Download as File

Markdown (Informal)

[Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification](https://aclanthology.org/W19-4629/) (Talafha et al., WANLP 2019)

Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification (Talafha et al., WANLP 2019)

ACL

Bashar Talafha, Wael Farhan, Ahmed Altakrouri, and Hussein Al-Natsheh. 2019. Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 239–243, Florence, Italy. Association for Computational Linguistics.