Satoshi Sekine

2024

pdf bib abs
Analysis of LLM’s “Spurious” Correct Answers Using Evidence Information of Multi-hop QA Datasets
Ai Ishii | Naoya Inoue | Hisami Suzuki | Satoshi Sekine
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

Recent LLMs show an impressive accuracy on one of the hallmark tasks of language understanding, namely Question Answering (QA). However, it is not clear if the correct answers provided by LLMs are actually grounded on the correct knowledge related to the question. In this paper, we use multi-hop QA datasets to evaluate the accuracy of the knowledge LLMs use to answer questions, and show that as much as 31% of the correct answers by the LLMs are in fact spurious, i.e., the knowledge LLMs used to ground the answer is wrong while the answer is correct. We present an analysis of these spurious correct answers by GPT-4 using three datasets in two languages, while suggesting future pathways to correct the grounding information using existing external knowledge bases.

pdf bib abs
JEMHopQA: Dataset for Japanese Explainable Multi-Hop Question Answering
Ai Ishii | Naoya Inoue | Hisami Suzuki | Satoshi Sekine
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present JEMHopQA, a multi-hop QA dataset for the development of explainable QA systems. The dataset consists not only of question-answer pairs, but also of supporting evidence in the form of derivation triples, which contributes to making the QA task more realistic and difficult. It is created based on Japanese Wikipedia using both crowd-sourced human annotation as well as prompting a large language model (LLM), and contains a diverse set of question, answer and topic categories as compared with similar datasets released previously. We describe the details of how we built the dataset as well as the evaluation of the QA task presented by this dataset using GPT-4, and show that the dataset is sufficiently challenging for the state-of-the-art LLM while showing promise for combining such a model with existing knowledge resources to achieve better performance.

pdf bib abs
Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance
Ziqi Yin | Hao Wang | Kaito Horio | Daisuke Kawahara | Satoshi Sekine
Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)

We investigate the impact of politeness levels in prompts on the performance of large language models (LLMs). Polite language in human communications often garners more compliance and effectiveness, while rudeness can cause aversion, impacting response quality. We consider that LLMs mirror human communication traits, suggesting they align with human cultural norms. We assess the impact of politeness in prompts on LLMs across English, Chinese, and Japanese tasks. We observed that impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes. The best politeness level is different according to the language. This phenomenon suggests that LLMs not only reflect human behavior but are also influenced by language, particularly in different cultural contexts. Our findings highlight the need to factor in politeness for cross-cultural natural language processing and LLM usage.

2023

pdf bib abs
What is the Real Intention behind this Question? Dataset Collection and Intention Classification
Maryam Sadat Mirzaei | Kourosh Meshgi | Satoshi Sekine
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Asking and answering questions are inseparable parts of human social life. The primary purposes of asking questions are to gain knowledge or request help which has been the subject of question-answering studies. However, questions can also reflect negative intentions and include implicit offenses, such as highlighting one’s lack of knowledge or bolstering an alleged superior knowledge, which can lead to conflict in conversations; yet has been scarcely researched. This paper is the first study to introduce a dataset (Question Intention Dataset) that includes questions with positive/neutral and negative intentions and the underlying intention categories within each group. We further conduct a meta-analysis to highlight tacit and apparent intents. We also propose a classification method using Transformers augmented by TF-IDF-based features and report the results of several models for classifying the main intention categories. We aim to highlight the importance of taking intentions into account, especially implicit and negative ones, to gain insight into conflict-evoking questions and better understand human-human communication on the web for NLP applications.

2022

pdf bib abs
Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities
Satoshi Sekine | Kouta Nakayama | Masako Nomoto | Maya Ando | Asuka Sumida | Koji Matsuda
Proceedings of the 29th International Conference on Computational Linguistics

This paper describes a resource of Wikipedias in 31 languages categorized into Extended Named Entity (ENE), which has 219 fine-grained NE categories. We first categorized 920 K Japanese Wikipedia pages according to the ENE scheme using machine learning, followed by manual validation. We then organized a shared task of Wikipedia categorization into 30 languages. The training data were provided by Japanese categorization and the language links, and the task was to categorize the Wikipedia pages into 30 languages, with no language links from Japanese Wikipedia (20M pages in total). Thirteen groups with 24 systems participated in the 2020 and 2021 tasks, sharing their outputs for resource-building. The Japanese categorization accuracy was 98.5%, and the best performance among the 30 languages ranges from 80 to 93 in F-measure. Using ensemble learning, we created outputs with an average F-measure of 86.8, which is 1.7 better than the best single systems. The total size of the resource is 32.5M pages, including the training data. We call this resource creation scheme “Resource by Collaborative Contribution (RbCC)”. We also constructed structuring tasks (attribute extraction and link prediction) using RbCC under our ongoing project, “SHINRA.”

pdf bib abs
Iterative Span Selection: Self-Emergence of Resolving Orders in Semantic Role Labeling
Shuhei Kurita | Hiroki Ouchi | Kentaro Inui | Satoshi Sekine
Proceedings of the 29th International Conference on Computational Linguistics

Semantic Role Labeling (SRL) is the task of labeling semantic arguments for marked semantic predicates. Semantic arguments and their predicates are related in various distinct manners, of which certain semantic arguments are a necessity while others serve as an auxiliary to their predicates. To consider such roles and relations of the arguments in the labeling order, we introduce iterative argument identification (IAI), which combines global decoding and iterative identification for the semantic arguments. In experiments, we first realize that the model with random argument labeling orders outperforms other heuristic orders such as the conventional left-to-right labeling order. Combined with simple reinforcement learning, the proposed model spontaneously learns the optimized labeling orders that are different from existing heuristic orders. The proposed model with the IAI algorithm achieves competitive or outperforming results from the existing models in the standard benchmark datasets of span-based SRL: CoNLL-2005 and CoNLL-2012.

pdf bib abs
Q-Learning Scheduler for Multi Task Learning Through the use of Histogram of Task Uncertainty
Kourosh Meshgi | Maryam Sadat Mirzaei | Satoshi Sekine
Proceedings of the 7th Workshop on Representation Learning for NLP

Simultaneous training of a multi-task learning network on different domains or tasks is not always straightforward. It could lead to inferior performance or generalization compared to the corresponding single-task networks. An effective training scheduling method is deemed necessary to maximize the benefits of multi-task learning. Traditional schedulers follow a heuristic or prefixed strategy, ignoring the relation of the tasks, their sample complexities, and the state of the emergent shared features. We proposed a deep Q-Learning Scheduler (QLS) that monitors the state of the tasks and the shared features using a novel histogram of task uncertainty, and through trial-and-error, learns an optimal policy for task scheduling. Extensive experiments on multi-domain and multi-task settings with various task difficulty profiles have been conducted, the proposed method is benchmarked against other schedulers, its superior performance has been demonstrated, and results are discussed.

pdf bib abs
Uncertainty Regularized Multi-Task Learning
Kourosh Meshgi | Maryam Sadat Mirzaei | Satoshi Sekine
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

By sharing parameters and providing task-independent shared features, multi-task deep neural networks are considered one of the most interesting ways for parallel learning from different tasks and domains. However, fine-tuning on one task may compromise the performance of other tasks or restrict the generalization of the shared learned features. To address this issue, we propose to use task uncertainty to gauge the effect of the shared feature changes on other tasks and prevent the model from overfitting or over-generalizing. We conducted an experiment on 16 text classification tasks, and findings showed that the proposed method consistently improves the performance of the baseline, facilitates the knowledge transfer of learned features to unseen data, and provides explicit control over the generalization of the shared model.

2021

pdf bib abs
Co-Teaching Student-Model through Submission Results of Shared Task
Kouta Nakayama | Shuhei Kurita | Akio Kobayashi | Yukino Baba | Satoshi Sekine
Findings of the Association for Computational Linguistics: EMNLP 2021

Shared tasks have a long history and have become the mainstream of NLP research. Most of the shared tasks require participants to submit only system outputs and descriptions. It is uncommon for the shared task to request submission of the system itself because of the license issues and implementation differences. Therefore, many systems are abandoned without being used in real applications or contributing to better systems. In this research, we propose a scheme to utilize all those systems which participated in the shared tasks. We use all participated system outputs as task teachers in this scheme and develop a new model as a student aiming to learn the characteristics of each system. We call this scheme “Co-Teaching.” This scheme creates a unified system that performs better than the task’s single best system. It only requires the system outputs, and slightly extra effort is needed for the participants and organizers. We apply this scheme to the “SHINRA2019-JP” shared task, which has nine participants with various output accuracies, confirming that the unified system outperforms the best system. Moreover, the code used in our experiments has been released.

Ranking the user comments posted on a news article is important for online news services because comment visibility directly affects the user experience. Research on ranking comments with different metrics to measure the comment quality has shown “constructiveness” used in argument analysis is promising from a practical standpoint. In this paper, we report a case study in which this constructiveness is examined in the real world. Specifically, we examine an in-house competition to improve the performance of ranking constructive comments and demonstrate the effectiveness of the best obtained model for a commercial service.

2020

pdf bib abs
Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set
Hassan S. Shavarani | Satoshi Sekine
Proceedings of the Twelfth Language Resources and Evaluation Conference

Wikipedia is a great source of general world knowledge which can guide NLP models better understand their motivation to make predictions. Structuring Wikipedia is the initial step towards this goal which can facilitate fine-grain classification of articles. In this work, we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set. We evaluate the dataset using the best models provided for ENE label set classification and show that the currently available classification models struggle with large datasets using fine-grained tag sets.

2019

pdf bib abs
Select and Attend: Towards Controllable Content Selection in Text Generation
Xiaoyu Shen | Jun Suzuki | Kentaro Inui | Hui Su | Dietrich Klakow | Satoshi Sekine
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the decoder. The decoupled content selection is human interpretable, whose value can be manually manipulated to control the content of generated text. The model can be trained end-to-end without human annotations by maximizing a lower bound of the marginal likelihood. We further propose an effective way to trade-off between performance and controllability with a single adjustable hyperparameter. In both data-to-text and headline generation tasks, our model achieves promising results, paving the way for controllable content selection in text generation.

pdf bib abs
Bridging the Defined and the Defining: Exploiting Implicit Lexical Semantic Relations in Definition Modeling
Koki Washio | Satoshi Sekine | Tsuneaki Kato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Definition modeling includes acquiring word embeddings from dictionary definitions and generating definitions of words. While the meanings of defining words are important in dictionary definitions, it is crucial to capture the lexical semantic relations between defined words and defining words. However, thus far, the utilization of such relations has not been explored for definition modeling. In this paper, we propose definition modeling methods that use lexical semantic relations. To utilize implicit semantic relations in definitions, we use unsupervisedly obtained pattern-based word-pair embeddings that represent semantic relations of word pairs. Experimental results indicate that our methods improve the performance in learning embeddings from definitions, as well as definition generation.

pdf bib abs
HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning
Hitomi Yanaka | Koji Mineshima | Daisuke Bekki | Kentaro Inui | Satoshi Sekine | Lasha Abzianidze | Johan Bos
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Large crowdsourced datasets are widely used for training and evaluating neural models on natural language inference (NLI). Despite these efforts, neural models have a hard time capturing logical inferences, including those licensed by phrase replacements, so-called monotonicity reasoning. Since no large dataset has been developed for monotonicity reasoning, it is still unclear whether the main obstacle is the size of datasets or the model architectures themselves. To investigate this issue, we introduce a new dataset, called HELP, for handling entailments with lexical and logical phenomena. We add it to training data for the state-of-the-art neural models and evaluate them on test sets for monotonicity phenomena. The results showed that our data augmentation improved the overall accuracy. We also find that the improvement is better on monotonicity inferences with lexical replacements than on downward inferences with disjunction and modification. This suggests that some types of inferences can be improved by our data augmentation while others are immune to it.

pdf bib abs
Analytic Score Prediction and Justification Identification in Automated Short Answer Scoring
Tomoya Mizumoto | Hiroki Ouchi | Yoriko Isobe | Paul Reisert | Ryo Nagata | Satoshi Sekine | Kentaro Inui
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper provides an analytical assessment of student short answer responses with a view to potential benefits in pedagogical contexts. We first propose and formalize two novel analytical assessment tasks: analytic score prediction and justification identification, and then provide the first dataset created for analytic short answer scoring research. Subsequently, we present a neural baseline model and report our extensive empirical results to demonstrate how our dataset can be used to explore new and intriguing technical challenges in short answer scoring. The dataset is publicly available for research purposes.

Monotonicity reasoning is one of the important reasoning skills for any intelligent natural language inference (NLI) model in that it requires the ability to capture the interaction between lexical and syntactic structures. Since no test set has been developed for monotonicity reasoning with wide coverage, it is still unclear whether neural models can perform monotonicity reasoning in a proper way. To investigate this issue, we introduce the Monotonicity Entailment Dataset (MED). Performance by state-of-the-art NLI models on the new test set is substantially worse, under 55%, especially on downward reasoning. In addition, analysis using a monotonicity-driven data augmentation method showed that these models might be limited in their generalization ability in upward and downward reasoning.

2018

Named entity recognition (NER) has attracted a substantial amount of research. Recently, several neural network-based models have been proposed and achieved high performance. However, there is little research on fine-grained NER (FG-NER), in which hundreds of named entity categories must be recognized, especially for non-English languages. It is still an open question whether there is a model that is robust across various settings or the proper model varies depending on the language, the number of named entity categories, and the size of training datasets. This paper first presents an empirical comparison of FG-NER models for English and Japanese and demonstrates that LSTM+CNN+CRF (Ma and Hovy, 2016), one of the state-of-the-art methods for English NER, also works well for English FG-NER but does not work well for Japanese, a language that has a large number of character types. To tackle this problem, we propose a method to improve the neural network-based Japanese FG-NER performance by removing the CNN layer and utilizing dictionary and category embeddings. Experiment results show that the proposed method improves Japanese FG-NER F-score from 66.76% to 75.18%.

pdf bib abs
What Makes Reading Comprehension Questions Easier?
Saku Sugawara | Kentaro Inui | Satoshi Sekine | Akiko Aizawa
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.

2017

2016

pdf bib abs
Name Variation in Community Question Answering Systems
Anietie Andy | Satoshi Sekine | Mugizi Rwebangira | Mark Dredze
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Name Variation in Community Question Answering Systems Abstract Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, “Who is the best player for the Reds?” and “Who is currently the biggest star at Manchester United?” have a shared need but are worded differently; also, “Reds” and “Manchester United” are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

pdf bib abs
An Entity-Based approach to Answering Recurrent and Non-Recurrent Questions with Past Answers
Anietie Andy | Mugizi Rwebangira | Satoshi Sekine
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

An Entity-based approach to Answering recurrent and non-recurrent questions with Past Answers Abstract Community question answering (CQA) systems such as Yahoo! Answers allow registered-users to ask and answer questions in various question categories. However, a significant percentage of asked questions in Yahoo! Answers are unanswered. In this paper, we propose to reduce this percentage by reusing answers to past resolved questions from the site. Specifically, we propose to satisfy unanswered questions in entity rich categories by searching for and reusing the best answers to past resolved questions with shared needs. For unanswered questions that do not have a past resolved question with a shared need, we propose to use the best answer to a past resolved question with similar needs. Our experiments on a Yahoo! Answers dataset shows that our approach retrieves most of the past resolved questions that have shared and similar needs to unanswered questions.

pdf bib
Neural Joint Learning for Classifying Wikipedia Articles into Fine-grained Named Entity Types
Masatoshi Suzuki | Koji Matsuda | Satoshi Sekine | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). The previous system (Sekine 08) can only handle tokens and unrestricted wildcards in the query, such as * was established in *. However, being able to constrain the wildcards by POS, chunk or NE is quite useful to filter out noise. For example, the new system can search for NE=COMPANY was established in POS=CD. This finer specification reduces the number of outputs to less than half and avoids the ngrams which have a comma or a common noun at the first position or location information at the last position. It outputs the matched ngrams with their frequencies as well as all the contexts (i.e. sentences, KWIC lists and document ID information) where the matched ngrams occur in the corpus. It takes a fraction of a second for a search on a single CPU Linux-PC (1GB memory and 500GB disk) environment.

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2008

pdf bib
A Linguistic Knowledge Discovery Tool: Very Large Ngram Database Search with Arbitrary Wildcards
Satoshi Sekine
Coling 2008: Companion volume: Demonstrations

pdf bib abs
Extended Named Entity Ontology with Attribute Information
Satoshi Sekine
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Named Entities (NE) are regarded as an important type of semantic knowledge in many natural language processing (NLP) applications. Originally, a limited number of NE categories were proposed. In MUC, it was 7 categories - people, organization, location, time, date, money and percentage expressions. However, it was noticed that such a limited number of NE categories is too small for many applications. The author has proposed Extended Named Entity (ENE), which has about 200 categories (Sekine and Nobata 04). During the development of ENE, we noticed that many ENE categories have specific attributes, and those provide very important information for the entities. For example, rivers have attributes like source location, outflow, and length. Some such information is essential to knowing about the river, while the name is only a label which can be used to refer to the river. Also, such attributes are important information for many NLP applications. In this paper, we report on the design of a set of attributes for ENE categories. We used a bottom up approach to creating the knowledge using a Japanese encyclopedia, which contains abundant descriptions of ENE instances.

pdf bib abs
Sentiment Analysis Based on Probabilistic Models Using Inter-Sentence Information
Kugatsu Sadamitsu | Satoshi Sekine | Mikio Yamamoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes a new method of the sentiment analysis utilizing inter-sentence structures especially for coping with reversal phenomenon of word polarity such as quotation of others opinions on an opposite side. We model these phenomenon using Hidden Conditional Random Fields(HCRFs) with three kinds of features: transition features, polarity features and reversal (of polarity) features. Polarity features and reversal features are doubly added to each word, and each weight of the features are trained by the common structure of positive and negative corpus in, for example, assuming that reversal phenomenon occured for the same reason (features) in both polarity corpus. Our method achieved better accuracy than the Naive Bayes method and as good as SVMs.

2007

pdf bib
System Demonstration of On-Demand Information Extraction
Satoshi Sekine | Akira Oda
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task
Javier Artiles | Julio Gonzalo | Satoshi Sekine
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
Satoshi Sekine | Kentaro Inui | Ido Dagan | Bill Dolan | Danilo Giampiccolo | Bernardo Magnini
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

2006

pdf bib
Preemptive Information Extraction using Unrestricted Relation Discovery
Yusuke Shinyama | Satoshi Sekine
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Using Phrasal Patterns to Identify Discourse Relations
Manami Saito | Kazuhide Yamamoto | Satoshi Sekine
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
On-Demand Information Extraction
Satoshi Sekine
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
A System to Solve Language Tests for Second Grade Students
Manami Saito | Kazuhide Yamamoto | Satoshi Sekine | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
Automatic Paraphrase Discovery based on Context and Keywords between NE Pairs
Satoshi Sekine
Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

2004

pdf bib
Named Entity Discovery Using Comparable News Articles
Yusuke Shinyama | Satoshi Sekine
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Cross-lingual Information Extraction System Evaluation
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Automatic Construction of Japanese KATAKANA Variant List from Large Corpus
Takeshi Masuyama | Satoshi Sekine | Hiroshi Nakagawa
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy
Satoshi Sekine | Chikashi Nobata
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Automatic Extraction of Hyponyms from Japanese Newspapers. Using Lexico-syntactic Patterns
Maya Ando | Satoshi Sekine | Shun Ishizaki
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Discovering Relations among Named Entities from Large Corpora
Takaaki Hasegawa | Satoshi Sekine | Ralph Grishman
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications
Kiyotaka Uchimoto | Yujie Zhang | Kiyoshi Sudo | Masaki Murata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the Workshop on Multilingual Linguistic Resources

2003

pdf bib
pre-CODIE–Crosslingual On-Demand Information Extraction
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

pdf bib
An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Morphological Analysis of a Large Spontaneous Speech Corpus in Japanese
Kiyotaka Uchimoto | Chikashi Nobata | Atsushi Yamada | Satoshi Sekine | Hitoshi Isahara
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
A survey for Multi-Document Summarization
Satoshi Sekine | Chikashi Nobata
Proceedings of the HLT-NAACL 03 Text Summarization Workshop

pdf bib
Evaluation of Features for Sentence Extraction on Different Types of Corpora
Chikashi Nobata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

pdf bib
Paraphrase Acquisition for Information Extraction
Yusuke Shinyama | Satoshi Sekine
Proceedings of the Second International Workshop on Paraphrasing

2002

pdf bib
Text Generation from Keywords
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Morphological Analysis of the Spontaneous Speech Corpus
Kiyotaka Uchimoto | Chikashi Nobata | Atsushi Yamada | Satoshi Sekine | Hitoshi Isahara
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes

pdf bib
Summarization System Integrated with Named Entity Tagging and IE pattern Discovery
Chikashi Nobata | Satoshi Sekine | Hitoshi Isahara | Ralph Grishman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Extended Named Entity Hierarchy
Satoshi Sekine | Kiyoshi Sudo | Chikashi Nobata
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Automatic Pattern Acquisition for Japanese Information Extraction
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
Word Translation Based on Machine Learning Models Using Translation Memory and Corpora
Kiyotaka Uchimoto | Satoshi Sekine | Masaki Murata | Hitoshi Isahara
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
The Unknown Word Problem: a Morphological Analysis of Japanese Using Maximum Entropy Aided by a Dictionary
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

2000

pdf bib abs
Dependency Model using Posterior Context
Kiyotaka Uchimoto | Masaki Murata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the Sixth International Workshop on Parsing Technologies

We describe a new model for dependency structure analysis. This model learns the relationship between two phrasal units called bunsetsus as three categories; ‘between’, ‘dependent’, and ‘beyond’, and estimates the dependency likelihood by considering not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all of the bunsetsus to its right. We implemented this model based on the maximum entropy model. When using the Kyoto University corpus, the dependency accuracy of our model was 88%, which is about 1% higher than that of the conventional model using exactly the same features.

pdf bib
Backward Beam Search Algorithm for Dependency Analysis of Japanese
Satoshi Sekine | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Japanese Dependency Analysis using a Deterministic Finite State Transducer
Satoshi Sekine
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Word Order Acquisition from Corpora
Kiyotaka Uchimoto | Masaki Murata | Qing Ma | Satoshi Sekine | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Japanese Named Entity Extraction Evaluation - Analysis of Results -
Satoshi Sekine | Yoshio Eriguchi
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
IREX: IR & IE Evaluation Project in Japanese
Satoshi Sekine | Hitoshi Isahara
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
A Treebank of Spanish and its Application to Parsing
Antonio Moreno | Ralph Grishman | Susana López | Fernando Sánchez | Satoshi Sekine
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
Difficulty Indices for the Named Entity Task in Japanese
Chikashi Nobata | Satoshi Sekine | Jun’ichi Tsujii
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
Japanese Dependency Structure Analysis Based on Maximum Entropy Models
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
Ninth Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Statistical Matching of Two Ontologies
Satoshi Sekine | Kiyoshi Sudo | Takano Ogino
SIGLEX99: Standardizing Lexical Resources

1998

pdf bib
Description of the Japanese NE System Used for MET-2
Satoshi Sekine
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998

pdf bib
A Decision Tree Method for Finding and Classifying Names in Japanese Texts
Satoshi Sekine | Ralph Grishman | Hiroyuki Shinnou
Sixth Workshop on Very Large Corpora

pdf bib
Japanese IE System and Customization Tool
Chikashi Nobata | Satoshi Sekine | Roman Yangarber
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

1997

pdf bib
The Domain Dependence of Parsing
Satoshi Sekine
Fifth Conference on Applied Natural Language Processing

1996

pdf bib
Modeling Topic Coherence for Speech Recognition
Satoshi Sekine
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

1995

pdf bib abs
A Corpus-based Probabilistic Grammar with Only Two Non-terminals
Satoshi Sekine | Ralph Grishman
Proceedings of the Fourth International Workshop on Parsing Technologies

The availability of large, syntactically-bracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broad-coverage grammars, and in particular to train probabilistic grammars. A number of recent parsing experiments have also indicated that grammars whose production probabilities are dependent on the context can be more effective than context-free grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real nonterminals, and the other non-terminals or grammatical nodes are in effect embedded into the right-hand-sides of the S and NP rules. For example, one of the rules extracted from the tree bank would be S -> NP VBX JJ CC VBX NP [1] ( where NP is a non-terminal and the other symbols are terminals – part-of-speech tags of the Tree Bank). The most common structure in the Tree Bank associated with this expansion is (S NP (VP (VP VBX (ADJ JJ) CC (VP VBX NP)))) [2]. So if our parser uses rule [1] in parsing a sentence, it will generate structure [2] for the corresponding part of the sentence. Using 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules ( 23,386 for S, and 8,910 for NP). We also built a smaller version of the grammar based on higher frequency patterns for use as a back-up when the larger grammar is unable to produce a parse due to memory limitation. We applied this parser to 1,989 Wall Street Journal sentences (separate from the training set and with no limit on sentence length). Of the parsed sentences (1,899), the percentage of no-crossing sentences is 33.9%, and Parseval recall and precision are 73.43% and 72 .61%.

1994

pdf bib abs
Automatic Sublanguage Identification for a New Text
Satoshi Sekine
Second Workshop on Very Large Corpora

A number of theoretical studies have been devoted to the notion of sublanguage, which mainly concerns linguistic phenomena restricted by the domain or context. Furthermore, there are some successful NLP systems which have explicitly or implicitly addressed the sublanguage restrictions (e.g. TAUM-METEO, ATR). This suggests the following two objectives for future NLP research: 1) automatic linguistic knowledge acquisition for sublanguage, and 2) automatic definition of sublanguage and identification of it for a new text. The two issues become realistic owing to the appearance of large corpora. Despite of the recent bloom of the research on the first objective, there are few on the second objective. If this objective is achieved, NLP systems will be able to optimize to the sublanguage before processing the text, and this will be a significant help in automatic processing. A preliminary experiment aiming at the second objective is addressed in this paper. It is conducted on about 3 MB of Wall Street Journal corpus. We made up article clusters (sublanguages) based on word appearance, and the closest article cluster among the set of clusters is chosen for each test article. The comparison between the new articles and the clusters shows the success of the sublanguage identification and also the promising ability of the method. Also the result of an experiment using the first two sentences in the articles indicates the feasibility of applying this method to speech recognition or other systems which can't access the whole article prior to the processing.