Multi-Grained Chinese Word Segmentation

Chen Gong; Zhenghua Li (李正华); Min Zhang; Xinzhou Jiang

doi:10.18653/v1/D17-1072

Multi-Grained Chinese Word Segmentation

Chen Gong, Zhenghua Li, Min Zhang, Xinzhou Jiang

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use ... for bold, ... for italic, ... for underline, <sc>...</sc> for small-caps, <tt>...<tt> for typewriter text, <url>...</url> for URLs, <a href=...> for hyperlinks, and <par/> for paragraph breaks.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.

Anthology ID:: D17-1072
Volume:: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 692–703
Language:
URL:: https://aclanthology.org/D17-1072/
DOI:: 10.18653/v1/D17-1072
Bibkey:
Cite (ACL):: Chen Gong, Zhenghua Li, Min Zhang, and Xinzhou Jiang. 2017. Multi-Grained Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 692–703, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: Multi-Grained Chinese Word Segmentation (Gong et al., EMNLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/D17-1072.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{gong-etal-2017-multi,
    title = "Multi-Grained {C}hinese Word Segmentation",
    author = "Gong, Chen  and
      Li, Zhenghua  and
      Zhang, Min  and
      Jiang, Xinzhou",
    editor = "Palmer, Martha  and
      Hwa, Rebecca  and
      Riedel, Sebastian",
    booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
    month = sep,
    year = "2017",
    address = "Copenhagen, Denmark",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D17-1072/",
    doi = "10.18653/v1/D17-1072",
    pages = "692--703",
    abstract = "Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76{\%}, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="gong-etal-2017-multi">
    <titleInfo>
        <title>Multi-Grained Chinese Word Segmentation</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Chen</namePart>
        <namePart type="family">Gong</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Zhenghua</namePart>
        <namePart type="family">Li</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Min</namePart>
        <namePart type="family">Zhang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Xinzhou</namePart>
        <namePart type="family">Jiang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2017-09</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Martha</namePart>
            <namePart type="family">Palmer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Rebecca</namePart>
            <namePart type="family">Hwa</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Sebastian</namePart>
            <namePart type="family">Riedel</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Copenhagen, Denmark</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.</abstract>
    <identifier type="citekey">gong-etal-2017-multi</identifier>
    <identifier type="doi">10.18653/v1/D17-1072</identifier>
    <location>
        <url>https://aclanthology.org/D17-1072/</url>
    </location>
    <part>
        <date>2017-09</date>
        <extent unit="page">
            <start>692</start>
            <end>703</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Multi-Grained Chinese Word Segmentation
%A Gong, Chen
%A Li, Zhenghua
%A Zhang, Min
%A Jiang, Xinzhou
%Y Palmer, Martha
%Y Hwa, Rebecca
%Y Riedel, Sebastian
%S Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
%D 2017
%8 September
%I Association for Computational Linguistics
%C Copenhagen, Denmark
%F gong-etal-2017-multi
%X Traditionally, word segmentation (WS) adopts the single-grained formalism, where a sentence corresponds to a single word sequence. However, Sproat et al. (1997) show that the inter-native-speaker consistency ratio over Chinese word boundaries is only 76%, indicating single-grained WS (SWS) imposes unnecessary challenges on both manual annotation and statistical modeling. Moreover, WS results of different granularities can be complementary and beneficial for high-level applications. This work proposes and addresses multi-grained WS (MWS). We build a large-scale pseudo MWS dataset for model training and tuning by leveraging the annotation heterogeneity of three SWS datasets. Then we manually annotate 1,500 test sentences with true MWS annotations. Finally, we propose three benchmark approaches by casting MWS as constituent parsing and sequence labeling. Experiments and analysis lead to many interesting findings.
%R 10.18653/v1/D17-1072
%U https://aclanthology.org/D17-1072/
%U https://doi.org/10.18653/v1/D17-1072
%P 692-703

Download as File

Markdown (Informal)

[Multi-Grained Chinese Word Segmentation](https://aclanthology.org/D17-1072/) (Gong et al., EMNLP 2017)

Multi-Grained Chinese Word Segmentation (Gong et al., EMNLP 2017)

ACL

Chen Gong, Zhenghua Li, Min Zhang, and Xinzhou Jiang. 2017. Multi-Grained Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 692–703, Copenhagen, Denmark. Association for Computational Linguistics.