Attapol Rutherford


2022

pdf bib
Thai Nested Named Entity Recognition Corpus
Weerayut Buaphet | Can Udomcharoenchaikit | Peerat Limkonchotiwat | Attapol Rutherford | Sarana Nutanong
Findings of the Association for Computational Linguistics: ACL 2022

This paper presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from 4,894 documents in the domains of news articles and restaurant reviews. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. To understand the new challenges our proposed dataset brings to the field, we conduct an experimental study on (i) cutting edge N-NER models with the state-of-the-art accuracy in English and (ii) baseline methods based on well-known language model architectures. From the experimental results, we obtained two key findings. First, all models produced poor F1 scores in the tail region of the class distribution. There is little or no performance improvement provided by these models with respect to the baseline methods with our Thai dataset. These findings suggest that further investigation is required to make a multilingual N-NER solution that works well across different languages.

pdf bib
More Than Words: Collocation Retokenization for Latent Dirichlet Allocation Models
Jin Cheevaprawatdomrong | Alexandra Schofield | Attapol Rutherford
Findings of the Association for Computational Linguistics: ACL 2022

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. Previous studies show that representing bigrams collocations in the input can improve topic coherence in English. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of retokenization based on chi-squared measures, t-statistics, and raw frequency to merge frequent token ngrams into collocations when preparing input to the LDA model. Based on the goodness of fit and the coherence metric, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those of unmerged models.

2020

pdf bib
Syllable-based Neural Thai Word Segmentation
Pattarawat Chormai | Ponrawee Prasertsom | Jin Cheevaprawatdomrong | Attapol Rutherford
Proceedings of the 28th International Conference on Computational Linguistics

Word segmentation is a challenging pre-processing step for Thai Natural Language Processing due to the lack of explicit word boundaries.The previous systems rely on powerful neural network architecture alone and ignore linguistic substructures of Thai words. We utilize the linguistic observation that Thai strings can be segmented into syllables, which should narrow down the search space for the word boundaries and provide helpful features. Here, we propose a neural Thai Word Segmenter that uses syllable embeddings to capture linguistic constraints and uses dilated CNN filters to capture the environment of each character. Within this goal, we develop the first ML-based Thai orthographical syllable segmenter, which yields syllable embeddings to be used as features by the word segmenter. Our word segmentation system outperforms the previous state-of-the-art system in both speed and accuracy on both in-domain and out-domain datasets.

2019

pdf bib
Written on Leaves or in Stones?: Computational Evidence for the Era of Authorship of Old Thai Prose
Attapol Rutherford | Santhawat Thanyawong
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We aim to provide computational evidence for the era of authorship of two important old Thai texts: Traiphumikatha and Pumratchatham. The era of authorship of these two books is still an ongoing debate among Thai literature scholars. Analysis of old Thai texts present a challenge for standard natural language processing techniques, due to the lack of corpora necessary for building old Thai word and syllable segmentation. We propose an accurate and interpretable model to classify each segment as one of the three eras of authorship (Sukhothai, Ayuddhya, or Rattanakosin) without sophisticated linguistic preprocessing. Contrary to previous hypotheses, our model suggests that both books were written during the Sukhothai era. Moreover, the second half of the Pumratchtham is uncharacteristic of the Sukhothai era, which may have confounded literary scholars in the past. Further, our model reveals that the most indicative linguistic changes stem from unidirectional grammaticalized words and polyfunctional words, which show up as most dominant features in the model.

2017

pdf bib
A Systematic Study of Neural Discourse Models for Implicit Discourse Relation
Attapol Rutherford | Vera Demberg | Nianwen Xue
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Inferring implicit discourse relations in natural language text is the most difficult subtask in discourse parsing. Many neural network models have been proposed to tackle this problem. However, the comparison for this task is not unified, so we could hardly draw clear conclusions about the effectiveness of various architectures. Here, we propose neural network models that are based on feedforward and long-short term memory architecture and systematically study the effects of varying structures. To our surprise, the best-configured feedforward architecture outperforms LSTM-based model in most cases despite thorough tuning. Further, we compare our best feedforward system with competitive convolutional and recurrent networks and find that feedforward can actually be more effective. For the first time for this task, we compile and publish outputs from previous neural and non-neural systems to establish the standard for further comparison.

2016

pdf bib
CoNLL 2016 Shared Task on Multilingual Shallow Discourse Parsing
Nianwen Xue | Hwee Tou Ng | Sameer Pradhan | Attapol Rutherford | Bonnie Webber | Chuan Wang | Hongmin Wang
Proceedings of the CoNLL-16 shared task

pdf bib
Robust Non-Explicit Neural Discourse Parser in English and Chinese
Attapol Rutherford | Nianwen Xue
Proceedings of the CoNLL-16 shared task

2015

pdf bib
Improving the Inference of Implicit Discourse Relations via Classifying Explicit Discourse Connectives
Attapol Rutherford | Nianwen Xue
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
The CoNLL-2015 Shared Task on Shallow Discourse Parsing
Nianwen Xue | Hwee Tou Ng | Sameer Pradhan | Rashmi Prasad | Christopher Bryant | Attapol Rutherford
Proceedings of the Nineteenth Conference on Computational Natural Language Learning - Shared Task

2014

pdf bib
Discovering Implicit Discourse Relations Through Brown Cluster Pair Representation and Coreference Patterns
Attapol Rutherford | Nianwen Xue
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics