2024
pdf
bib
abs
Stage Direction Classification in French Theater: Transfer Learning Experiments
Alexia Schneider
|
Pablo Ruiz Fabo
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
The automatic classification of stage directions is a little explored topic in computational drama analysis, in spite of their relevance for plays’ structural and stylistic analysis. With a view to start assessing good practices for the automatic annotation of this textual element, we developed a 13-class stage direction typology, based on annotations in the FreDraCor corpus (French-language plays), but abstracting away from their huge variability while still providing classes useful for literary research. We fine-tuned transformers-based models to classify against the typology, gradually decreasing the corpus size used for fine tuning, to compare model efficiency with reduced training data. A result comparison speaks in favour of distilled monolingual models for this task, and, unlike earlier research on German, shows no negative effects of model case-sensitivity. The results have practical relevance for computational literary studies, as comparing classification results with complementary stage direction typologies, limiting the amount of manual annotation needed to apply them, would be helpful towards a systematic study of this important textual element.
pdf
bib
abs
Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: The Cases of Two Regional Languages of France
Marianne Vergez-Couret
|
Delphine Bernhard
|
Michael Nauge
|
Myriam Bras
|
Pablo Ruiz Fabo
|
Carole Werner
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Metadata are key components of language resources and facilitate their exploitation and re-use. Their creation is a labour intensive process and requires a modeling step, which identifies resource-specific information as well as standards and controlled vocabularies that can be reused. In this article, we focus on metadata for documenting text bases for regional languages of France characterised by several levels of variation (space, time, usage, social status), based on a survey of existing metadata schema. Moreover, we implement our metadata model as a database structure for the Heurist data management system, which combines both the ease of use of spreadsheets and the ability to model complex relationships between entities of relational databases. The Heurist template is made freely available and was used to describe metadata for text bases in Alsatian and Poitevin-Santongeais. We also propose tools to automatically generate XML metadata headers files from the database.
2022
pdf
bib
abs
ELAL: An Emotion Lexicon for the Analysis of Alsatian Theatre Plays
Delphine Bernhard
|
Pablo Ruiz Fabo
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this work, we present a novel and manually corrected emotion lexicon for the Alsatian dialects, including graphical variants of Alsatian lexical items. These High German dialects are spoken in the North-East of France. They are used mainly orally, and thus lack a stable and consensual spelling convention. There has nevertheless been a continuous literary production since the middle of the 17th century and, in particular, theatre plays. A large sample of Alsatian theatre plays is currently being encoded according to the Text Encoding Initiative (TEI) Guidelines. The emotion lexicon will be used to perform automatic emotion analysis in this corpus of theatre plays. We used a graph-based approach to deriving emotion scores and translations, relying only on bilingual lexicons, cognates and spelling variants. The source lexicons for emotion scores are the NRC Valence Arousal and Dominance and NRC Emotion Intensity lexicons.
2017
pdf
bib
abs
Enjambment Detection in a Large Diachronic Corpus of Spanish Sonnets
Pablo Ruiz Fabo
|
Clara Martínez Cantón
|
Thierry Poibeau
|
Elena González-Blanco
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Enjambment takes place when a syntactic unit is broken up across two lines of poetry, giving rise to different stylistic effects. In Spanish literary studies, there are unclear points about the types of stylistic effects that can arise, and under which linguistic conditions. To systematically gather evidence about this, we developed a system to automatically identify enjambment (and its type) in Spanish. For evaluation, we manually annotated a reference corpus covering different periods. As a scholarly corpus to apply the tool, from public HTML sources we created a diachronic corpus covering four centuries of sonnets (3750 poems), and we analyzed the occurrence of enjambment across stanzaic boundaries in different periods. Besides, we found examples that highlight limitations in current definitions of enjambment.
2016
pdf
bib
abs
More than Word Cooccurrence: Exploring Support and Opposition in International Climate Negotiations with Semantic Parsing
Pablo Ruiz Fabo
|
Clément Plancq
|
Thierry Poibeau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Text analysis methods widely used in digital humanities often involve word co-occurrence, e.g. concept co-occurrence networks. These methods provide a useful corpus overview, but cannot determine the predicates that relate co-occurring concepts. Our goal was identifying propositions expressing the points supported or opposed by participants in international climate negotiations. Word co-occurrence methods were not sufficient, and an analysis based on open relation extraction had limited coverage for nominal predicates. We present a pipeline which identifies the points that different actors support and oppose, via a domain model with support/opposition predicates, and analysis rules that exploit the output of semantic role labelling, syntactic dependencies and anaphora resolution. Entity linking and keyphrase extraction are also performed on the propositions related to each actor. A user interface allows examining the main concepts in points supported or opposed by each participant, which participants agree or disagree with each other, and about which issues. The system is an example of tools that digital humanities scholars are asking for, to render rich textual information (beyond word co-occurrence) more amenable to quantitative treatment. An evaluation of the tool was satisfactory.
2015
pdf
bib
Combining Open Source Annotators for Entity Linking through Weighted Voting
Pablo Ruiz
|
Thierry Poibeau
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics
pdf
bib
EL92: Entity Linking Combining Open Source Annotators via Weighted Voting
Pablo Ruiz
|
Thierry Poibeau
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
pdf
bib
ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators
Pablo Ruiz
|
Thierry Poibeau
|
Frédérique Mélanie
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
2014
pdf
bib
abs
Phoneme Similarity Matrices to Improve Long Audio Alignment for Automatic Subtitling
Pablo Ruiz
|
Aitor Álvarez
|
Haritz Arzelus
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Long audio alignment systems for Spanish and English are presented, within an automatic subtitling application. Language-specific phone decoders automatically recognize audio contents at phoneme level. At the same time, language-dependent grapheme-to-phoneme modules perform a transcription of the script for the audio. A dynamic programming algorithm (Hirschberg’s algorithm) finds matches between the phonemes automatically recognized by the phone decoder and the phonemes in the scripts transcription. Alignment accuracy is evaluated when scoring alignment operations with a baseline binary matrix, and when scoring alignment operations with several continuous-score matrices, based on phoneme similarity as assessed through comparing multivalued phonological features. Alignment accuracy results are reported at phoneme, word and subtitle level. Alignment accuracy when using the continuous scoring matrices based on phonological similarity was clearly higher than when using the baseline binary matrix.