Shin’Ichi Satoh

Also published as: Shin’ichi Satoh


2024

pdf bib
Utiliser l’explicabilité des modèles pour mettre en évidence les expressions genrées dans la parole
François Buet | Camille Guinaudeau | Cyril Grouin | Sahar Ghannay | Shin’Ichi Satoh
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Dans de nombreux pays, des études ont souligné la sous-représentation des femmes dans les médias.Mais au-delà du déséquilibre quantitatif se pose la question de l’asymétrie qualitative des représentations des hommes et des femmes.Comment automatiser l’évaluation des contenus et des traits saillants spécifiques aux discours masculins et féminins ?Nous proposons dans cette étude d’exploiter les connaissances acquises par un modèle de classification entraîné à la détection du genre sur des transcriptions automatiques, afin de mettre en évidence des motifs distinctifs du discours masculin ou féminin.Notre approche est basée sur l’utilisation de méthodes développées pour l’intelligence artificielle explicable (IAX), afin de calculer des scores d’attribution au niveau des unités.

pdf bib
Vers une pédagogie inclusive : une classification multimodale des illustrations de manuels scolaires pour des environnements d’apprentissage adaptés
Saumya Yadav | Élise Lincker | Caroline Huron | Stéphanie Martin | Camille Guinaudeau | Shin’Ichi Satoh | Jainendra Shukla
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position

Afin de favoriser une éducation inclusive, des systèmes automatiques capables d’adapter les manuels scolaires pour les rendre accessibles aux enfants en situation de handicap sont nécessaires. Dans ce contexte, nous proposons de classifier les images associées aux exercices selon trois classes (Essentielle, Informative et Inutile) afin de décider de leur intégration ou non dans la version accessible du manuel pour les enfants malvoyants. Sur un ensemble de données composé de 652 paires (texte, image), nous utilisons des approches monomodales et multimodales à l’état de l’art et montrons que les approches fondées sur le texte obtiennent les meilleurs résultats. Le modèle CamemBERT atteint ainsi une exactitude de 85,25% lorsqu’il est combiné avec des stratégies de gestion de données déséquilibrées. Pour mieux comprendre la relation entre le texte et l’image dans les exercices des manuels, nous effectuons également une analyse qualitative des résultats obtenus avec et sans la modalité image et utilisons la méthode LIME pour expliquer la décision de nos modèles.

2023

pdf bib
Referring Image Segmentation via Joint Mask Contextual Embedding Learning and Progressive Alignment Network
Ziling Huang | Shin’ichi Satoh
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Referring image segmentation is a task that aims to predict pixel-wise masks corresponding to objects in an image described by natural language expressions. Previous methods for referring image segmentation employ a cascade framework to break down complex problems into multiple stages. However, its defects also obvious: existing methods within the cascade framework may encounter challenges in both maintaining a strong focus on the most relevant information during specific stages of the referring image segmentation process and rectifying errors propagated from early stages, which can ultimately result in sub-optimal performance. To address these limitations, we propose the Joint Mask Contextual Embedding Learning Network (JMCELN). JMCELN is designed to enhance the Cascade Framework by incorporating a Learnable Contextual Embedding and a Progressive Alignment Network (PAN). The Learnable Contextual Embedding module dynamically stores and utilizes reasoning information based on the current mask prediction results, enabling the network to adaptively capture and refine pertinent information for improved mask prediction accuracy. Furthermore, the Progressive Alignment Network (PAN) is introduced as an integral part of JMCELN. PAN leverages the output from the previous layer as a filter for the current output, effectively reducing inconsistencies between predictions from different stages. By iteratively aligning the predictions, PAN guides the Learnable Contextual Embedding to incorporate more discriminative information for reasoning, leading to enhanced prediction quality and a reduction in error propagation. With these methods, we achieved state-of-the-art results on three commonly used benchmarks, especially in more intricate datasets. The code will be released.

2018

pdf bib
Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation
Ryota Hinami | Shin’ichi Satoh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Thanks to the success of object detection technology, we can retrieve objects of the specified classes even from huge image collections. However, the current state-of-the-art object detectors (such as Faster R-CNN) can only handle pre-specified classes. In addition, large amounts of positive and negative visual samples are required for training. In this paper, we address the problem of open-vocabulary object retrieval and localization, where the target object is specified by a textual query (e.g., a word or phrase). We first propose Query-Adaptive R-CNN, a simple extension of Faster R-CNN adapted to open-vocabulary queries, by transforming the text embedding vector into an object classifier and localization regressor. Then, for discriminative training, we then propose negative phrase augmentation (NPA) to mine hard negative samples which are visually similar to the query and at the same time semantically mutually exclusive of the query. The proposed method can retrieve and localize objects specified by a textual query from one million images in only 0.5 seconds with high precision.

2016

pdf bib
Video Event Detection by Exploiting Word Dependencies from Image Captions
Sang Phan | Yusuke Miyao | Duy-Dinh Le | Shin’ichi Satoh
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Video event detection is a challenging problem in information and multimedia retrieval. Different from single action detection, event detection requires a richer level of semantic information from video. In order to overcome this challenge, existing solutions often represent videos using high level features such as concepts. However, concept-based representation can be confusing because it does not encode the relationship between concepts. This issue can be addressed by exploiting the co-occurrences of the concepts, however, it often leads to a very huge number of possible combinations. In this paper, we propose a new approach to obtain the relationship between concepts by exploiting the syntactic dependencies between words in the image captions. The main advantage of this approach is that it significantly reduces the number of informative combinations between concepts. We conduct extensive experiments to analyze the effectiveness of using the new dependency representation for event detection on two large-scale TRECVID Multimedia Event Detection 2013 and 2014 datasets. Experimental results show that i) Dependency features are more discriminative than concept-based features. ii) Dependency features can be combined with our current event detection system to further improve the performance. For instance, the relative improvement can be as far as 8.6% on the MEDTEST14 10Ex setting.