Conference on Computational Natural Language Learning (2024)

Volumes

Proceedings of the 28th Conference on Computational Natural Language Learning 41 papers
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning 30 papers

pdf (full)
bib (full) Proceedings of the 28th Conference on Computational Natural Language Learning

Proceedings of the 28th Conference on Computational Natural Language Learning
Libby Barak | Malihe Alikhani

Words That Stick: Using Keyword Cohesion to Improve Text Segmentation
Amit Maraj | Miguel Vargas Martin | Masoud Makrehchi

Text Segmentation (TS) is the idea of segmenting bodies of text into coherent blocks, mostly defined by the topics each segment contains. Historically, techniques in this area have been unsupervised, with more success recently coming from supervised methods instead. Although these approaches see better performance, they require training data and upfront training time. We propose a new method called Coherence, where we use strong sentence embeddings to pull representational keywords as the main constructor of sentences when comparing them to one another. Additionally, we include a storage of previously found keywords for the purposes of creating a more accurate segment representation instead of just the immediate sentence in question. With our system, we show improved results over current state-of-the-art unsupervised techniques when analyzed using Pk and WindowDiff scores. Because its unsupervised, Coherence requires no fine-tuning.

pdf bib abs

Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances
Alina Wróblewska

Selectively processing noisy utterances while effectively disregarding speech-specific elements poses no considerable challenge for humans, as they exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise (i.e. filled pauses, disfluencies, and restarts). These abilities may be driven by mechanisms based on acquired grammatical rules that compose abstract syntactic-semantic structures within utterances. Segments without syntactic and semantic significance are consistently disregarded in these structures. The structures, in tandem with lexis, likely underpin language comprehension and thus facilitate effective communication.In our study, grounded in linguistically motivated experiments, we investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks. In particular, we examine the ability of LLMs to extract well-structured utterances from transcriptions of noisy dialogues. We conduct two evaluation experiments in the Polish language scenario, using a dataset presumably unfamiliar to LLMs to mitigate the risk of data contamination. Our results show that not all extracted utterances are correctly structured, indicating that either LLMs do not fully acquire syntactic-semantic rules or they acquire them but cannot apply them effectively. We conclude that the ability of LLMs to comprehend noisy utterances is still relatively superficial compared to human proficiency in processing them.

pdf bib abs

Sociocultural norms serve as guiding principles for personal conduct in social interactions within a particular society or culture. The study of norm discovery has seen significant development over the last few years, with various interesting approaches. However, it is difficult to adopt these approaches to discover norms in a new culture, as they rely either on human annotations or real-world dialogue contents. This paper presents a robust automatic norm discovery pipeline, which utilizes the cultural knowledge of GPT-3.5 Turbo (ChatGPT) along with several social factors. By using these social factors and ChatGPT, our pipeline avoids the use of human dialogues that tend to be limited to specific scenarios, as well as the use of human annotations that make it difficult and costly to enlarge the dataset. The resulting database - Multi-cultural Norm Base (MNB) - covers 6 distinct cultures, with over 150k sociocultural norm statements in total. A state-of-the-art Large Language Model (LLM), Llama 3, fine-tuned with our proposed dataset, shows remarkable results on various downstream tasks, outperforming models fine-tuned on other datasets significantly.

pdf bib abs

Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause Processing
Kate McCurdy | Michael Hahn

English relative clauses are a critical test case for theories of syntactic processing. Expectation- and memory-based accounts make opposing predictions, and behavioral experiments have found mixed results. We present a technical extension of Lossy Context Surprisal (LCS) and use it to model relative clause processing in three behavioral experiments. LCS predicts key results at distinct retention rates, showing that task-dependent memory demands can account for discrepant behavioral patterns in the literature.

pdf bib abs

Global-Pruner: A Stable and Efficient Pruner for Retraining-Free Pruning of Encoder-Based Language Models
Guangzhen Yao | Yuehan Wang | Hui Xu | Long Zhang | Miao Qi

Large language models (LLMs) have achieved significant success in complex tasks across various domains, but they come with high computational costs and inference latency issues. Pruning, as an effective method, can significantly reduce inference costs. However, current pruning algorithms for encoder-based language models often focus on locally optimal solutions, neglecting a comprehensive exploration of the global solution space. This oversight can lead to instability in the solution process, thereby affecting the overall performance of the model. To address these challenges, we propose a structured pruning algorithm named G-Pruner (Global Pruner), comprising two integral components: PPOM (Proximal Policy Optimization Mask) and CG²MT (Conjugate Gradient Squared Mask Tuning), utilizing a global optimization strategy. This strategy not only eliminates the need for retraining but also ensures the algorithm’s stability and adaptability to environmental changes, effectively addressing the issue of focusing solely on immediate optima while neglecting long-term effects. This method is evaluated on the GLUE and SQuAD benchmarks using BERTBASE and DistilBERT models. The experimental results indicate that without any retraining, G-Pruner achieves significant accuracy improvements on the SQuAD_2.0 task with a FLOPs constraint of 60%, demonstrating a 6.02% increase in F1 score compared with baseline algorithms.

pdf bib abs

Transformer verbatim in-context retrieval across time and scale
Kristijan Armeni | Marko Pranjić | Senja Pollak

To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.

pdf bib abs

Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the writing style more consistent. Even so, comprehensive evaluation of a model’s capacity to perform these skills and the ability to edit remains sparse. This work introduces EditEval: An instruction-based, benchmark and evaluation suite that leverages high-quality existing and new datasets in English for the automatic evaluation of editing capabilities, such as making text more cohesive and paraphrasing. We evaluate several pre-trained models, which shows that InstructGPT and PEER on average perform the best, but that most baselines fall below the supervised state-of-the-art, particularly when neutralizing and updating information. Our analysis also shows that commonly used metrics for editing tasks do not always correlate well, and that prompts leading to the strongest performance do not necessarily elicit strong performance across different models. Through the release of this benchmark (code and data available at https://github.com/facebookresearch/EditEval) and a publicly available leaderboard challenge, we hope to unlock future work on developing models more capable of controllable and iterative editing.

pdf bib abs

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model’s tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, Constrained Word2Vec (CW2V), which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.

pdf bib abs

Critical Questions Generation: Motivation and Challenges
Blanca Calvo Figueras | Rodrigo Agerri

The development of Large Language Models (LLMs) has brought impressive performances on mitigation strategies against misinformation, such as counterargument generation. However, LLMs are still seriously hindered by outdated knowledge and by their tendency to generate hallucinated content. In order to circumvent these issues, we propose a new task, namely, Critical Questions Generation, consisting of processing an argumentative text to generate the critical questions (CQs) raised by it.In argumentation theory CQs are tools designed to lay bare the blind spots of an argument by pointing at the information it could be missing.Thus, instead of trying to deploy LLMs to produce knowledgeable and relevant counterarguments, we use them to question arguments, without requiring any external knowledge.Research on CQs Generation using LLMs requires a reference dataset for large scale experimentation. Thus, in this work we investigate two complementary methods to create such a resource: (i) instantiating CQs templates as defined by Walton’s argumentation theory and (ii), using LLMs as CQs generators. By doing so, we contribute with a procedure to establish what is a valid CQ and conclude that, while LLMs are reasonable CQ generators, they still have a wide margin for improvement in this task.

pdf bib abs

Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy
Pengfei Yu | Heng Ji

Large Language Models (LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information and the requirements on structured updating corpus. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities. To evaluate and address the core challenge, we propose a new task formulation of the information updating task that only requires the provision of an unstructured updating corpus and evaluates the performance of information updating on the generalizability to question-answer pairs pertaining to the updating information.We further propose a novel and effective pipeline approach for the task, highlighting a self-prompting-based question-answer generation process and a associative distillation methods to bridge the LM-logical discrepancy.We develop two datasets for evaluation, one sourced from news articles published in March and April 2023, and the other from the Natural Questions benchmark.Experimental results demonstrate the superiority of our approach, significantly increasing the factual consistency score (on a scale from 0 to 1) by up to 0.16. Furthermore, our method effectively mitigates forgetting utilizing a compact replay buffer with only 2.3% of the training tokens.

pdf bib abs

Causal ATE Mitigates Unintended Bias in Controlled Text Generation
Rahul Madhavan | Kahini Wadhawan

We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methodsfor the attribute control task in Language Models(LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric. We provide experimental validations for our claims and release our code (anonymously) here: [github.com/causalate-mitigates-bias](https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias).

pdf bib abs

On Functional Competence of LLMs for Linguistic Disambiguation
Raihan Kibria | Sheikh Intiser Uddin Dipta | Muhammad Abdullah Adnan

We study some Large Language Models to explore their deficiencies in resolving sense ambiguities. In this connection, we evaluate their performance on well-known word sense disambiguation datasets. Word Sense Disambiguation (WSD) has been a long-standing NLP problem, which has given rise to many evaluation datasets and models over the decades. Recently the emergence of Large Language Models (LLM) raises much hope in improving accuracy. In this work, we evaluate word sense disambiguation capabilities of four LLMs: OpenAI’s ChatGPT-3.5, Mistral’s 7b parameter model, Meta’s Llama 70b, and Google’s Gemini Pro. We evaluate many well-established datasets containing a variety of texts and senses on these. After observing the performances of some datasets, we selectively study some failure cases and identify the reasons for failures. We explore human judgments that would correct these failures. Our findings suggest that many failure cases are related to a lack of world knowledge and the reasoning to amalgamate this knowledge rather than the lack of linguistic knowledge. We categorize the judgments so that the next generation of LLMs can improve by incorporating deeper world knowledge and reasoning. We conclude that word sense disambiguation could serve as a guide for probing the reasoning power of LLMs to measure their functional competency. We also list the accuracy of these datasets. We find that on many occasions, accuracy drops to below 70%, which is much less than that of well-performing existing models.

pdf bib abs

AIStorySimilarity: Quantifying Story Similarity Using Narrative for Search, IP Infringement, and Guided Creativity
Jon Chun

Stories are central for interpreting experiences, communicating, and influencing each other via films, medical, media, and other narratives. Quantifying the similarity between stories has numerous applications including detecting IP infringement, detecting hallucinations, search/recommendation engines, and guiding human-AI collaborations. Despite this, traditional NLP text similarity metrics are limited to short text distance metrics like n-gram overlaps and embeddings. Larger texts require preprocessing with significant information loss through paraphrasing or multi-step decomposition. This paper introduces AIStorySimilarity, a novel benchmark to measure the semantic distance between long-text stories based on core structural elements drawn from narrative theory and script writing. Based on four narrative elements (characters, plot, setting, and themes) as well 31 sub-features within these, we use a SOTA LLM (gpt-3.5-turbo) to extract and evaluate the semantic similarity of a diverse set of major Hollywood movies. In addition, we compare human evaluation with story similarity scores computed three ways: extracting elements from film scripts before evaluation (Elements), directly evaluating entire scripts (Scripts), and extracting narrative elements from the parametric memory of SOTA LLMs without any provided scripts (GenAI). To the best of our knowledge, AIStorySimilarity is the first benchmark to measure long-text story similarity using a comprehensive approach to narrative theory. Code and data are available at https://github.com/jon-chun/AIStorySimiliarity.

pdf bib abs

SPAWNing Structural Priming Predictions from a Cognitively Motivated Parser
Grusha Prasad | Tal Linzen

Structural priming is a widely used psycholinguistic paradigm to study human sentence representations. In this work we introduce SPAWN, a cognitively motivated parser that can generate quantitative priming predictions from contemporary theories in syntax which assume a lexicalized grammar. By generating and testing priming predictions from competing theoretical accounts, we can infer which assumptions from syntactic theory are useful for characterizing the representations humans build when processing sentences. As a case study, we use SPAWN to generate priming predictions from two theories (Whiz-Deletion and Participial-Phase) which make different assumptions about the structure of English relative clauses. By modulating the reanalysis mechanism that the parser uses and strength of the parser’s prior knowledge, we generated nine sets of predictions from each of the two theories. Then, we tested these predictions using a novel web-based comprehension-to-production priming paradigm. We found that while the some of the predictions from the Participial-Phase theory aligned with human behavior, none of the predictions from the the Whiz-Deletion theory did, thus suggesting that the Participial-Phase theory might better characterize human relative clause representations.

pdf bib abs

Global Learning with Triplet Relations in Abstractive Summarization
Fengyu Lu | Jiaxin Duan | Junfei Liu

Abstractive summarization models learned with token-level maximum likelihood estimation suffer from exposure bias, that the condition for predicting the next token is discrepant during training and inference. Existing solutions bridge this gap by learning to estimate semantic or lexical qualities of a candidate summary from the global view, namely global learning (GL), yet ignore maintaining rational triplet-relations among document, reference summary, and candidate summaries, e.g., the candidate and reference summaries should have a similar faithfulness degree judging by a source document. In this paper, we propose an iterative autoregressive summarization paradigm - IARSum, which fuses the learning of triplet relations into a GL framework and further enhances summarization performance. Specifically, IARSum develops a dual-encoder network to enable the simultaneous input of a document and its candidate (or reference) summary. On this basis, it learns to 1) model the relative semantics defined over tuples (candidate, document) and (reference, document) respectively and balance them; 2) reduce lexical differences between candidate and reference summaries. Furthermore, IARSum iteratively reprocesses a generated candidate at inference time to ground higher quality. We conduct extensive experiments on two widely used datasets to test our method, and IARSum shows the new or matched state-of-the-art on diverse metrics.

pdf bib abs

TpT-ADE: Transformer Based Two-Phase ADE Extraction
Suryamukhi Kuchibhotla | Manish Singh

Extracting adverse reactions to medications or treatments is a crucial activity in the biomedical domain. The task involves identifying mentions of drugs and their adverse effects/events in raw text, which is challenging due to the unstructured nature of clinical narratives. In this paper, we propose TpT-ADE, a novel joint two-phase transformer model combined with natural language processing (NLP) techniques, to identify adverse events (AEs) caused by drugs. In the first phase of TpT-ADE, entities are extracted and are grounded with their standard terms using the Unified Medical Language System (UMLS) knowledge base. In the second phase, entity and relation classification is performed to determine the presence of a relationship between the drug and AE pairs. TpT-ADE also identifies the intensity of AE entities by constructing a parts-of-speech (POS) embedding model. Unlike previous approaches that use complex classifiers, TpT-ADE employs a shallow neural network and yet outperforms the state-of-the-art methods on the standard ADE corpus.

pdf bib abs

The Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading
Keren Gruteke Klein | Yoav Meiri | Omer Shubi | Yevgeni Berzak

The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.

pdf bib abs

Revisiting Hierarchical Text Classification: Inference and Metrics
Roman Plaud | Matthieu Labeau | Antoine Saillenfest | Thomas Bonald

Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC

pdf bib abs

NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication
Yuchen Lian | Tessa Verhoef | Arianna Bisazza

Recent advances in computational linguistics include simulating the emergence of human-like languages with interacting neural network agents, starting from sets of random symbols. The recently introduced NeLLCom framework (Lian et al., 2023) allows agents to first learn an artificial language and then use it to communicate, with the aim of studying the emergence of specific linguistics properties. We extend this framework (NeLLCom-X) by introducing more realistic role-alternating agents and group communication in order to investigate the interplay between language learnability, communication pressures, and group size effects. We validate NeLLCom-X by replicating key findings from prior research simulating the emergence of a word-order/case-marking trade-off. Next, we investigate how interaction affects linguistic convergence and emergence of the trade-off. The novel framework facilitates future simulations of diverse linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution.

pdf bib abs

A Novel Instruction Tuning Method for Vietnamese Mathematical Reasoning using Trainable Open-Source Large Language Models
Quang-Vinh Nguyen | Thanh-Do Nguyen | Van-Vinh Nguyen | Khac-Hoai Nam Bui

This study introduces Simple Reasoning with Code (SiRC), a novel instruction fine-tuning method for solving mathematical reasoning problems, particularly effective for Vietnamese, which is considered a low-resource language. Specifically, solving mathematical problems requires strategic and logical reasoning, which remains challenging in this research area. This paper presents a simple yet effective instruction fine-tuning method for mathematical reasoning. Unlike previous approaches, our proposed method effectively combines chain-of-thought reasoning with code transfer methods without requiring a sophisticated inference procedure. Furthermore, we focus on exploiting small open-source large language models (LLMs) for the Vietnamese language. In this regard, we first introduce a trainable Vietnamese math reasoning dataset, which is named ViMath-InstructCode. The proposed dataset is then used for fine-tuning open-source LLMs (e.g., less than 10 billion parameters). Experiments conducted on our custom ViMath-Bench dataset, the largest benchmarking dataset focusing on Vietnamese mathematical problems, indicate the promising results of our proposed method. Our source code and dataset are available for further exploitation.

pdf bib abs

Generalizations across filler-gap dependencies in neural language models
Katherine Howitt | Sathvik Nair | Allison Dods | Robert Melvin Hopkins

Humans develop their grammars by making structural generalizations from finite input. We ask how filler-gap dependencies (FGDs), which share a structural generalization despite diverse surface forms, might arise from the input. We explicitly control the input to a neural language model (NLM) to uncover whether the model posits a shared representation for FGDs. We show that while NLMs do have success differentiating grammatical from ungrammatical FGDs, they rely on superficial properties of the input, rather than on a shared generalization. Our work highlights the need for specific linguistic inductive biases to model language acquisition.

pdf bib abs

Interpretability studies have played an important role in the field of NLP. They focus on the problems of how models encode information or, for instance, whether linguistic capabilities allow them to prefer grammatical sentences to ungrammatical. Recently, several studies examined whether the models demonstrate patterns similar to humans and whether they are sensitive to the phenomena of interference like humans’ grammaticality judgements, including the phenomenon of agreement attraction.In this paper, we probe BERT and GPT models on the syntactic phenomenon of agreement attraction in Russian using the psycholinguistic data with syncretism. Working on the language with syncretism between some plural and singular forms allows us to differentiate between the effects of the surface form and of the underlying grammatical feature. Thus we can further investigate models’ sensitivity to this phenomenon and examine if the patterns of their behaviour are similar to human patterns. Moreover, we suggest a new way of comparing models’ and humans’ responses via statistical testing. We show that there are some similarities between models’ and humans’ results, while GPT is somewhat more aligned with human responses than BERT. Finally, preliminary results suggest that surface form syncretism influences attraction, perhaps more so than grammatical form syncretism.

pdf bib abs

Is Structure Dependence Shaped for Efficient Communication?: A Case Study on Coordination
Kohei Kajikawa | Yusuke Kubota | Yohei Oseki

Natural language exhibits various universal properties.But why do these universals exist?One explanation is that they arise from functional pressures to achieve efficient communication, a view which attributes cross-linguistic properties to domain-general cognitive abilities.This hypothesis has successfully addressed some syntactic universal properties such as compositionality and Greenbergian word order universals.However, more abstract syntactic universals have not been explored from the perspective of efficient communication.Among such universals, the most notable one is structure dependence, that is, grammar-internal operations crucially depend on hierarchical representations.This property has traditionally been taken to be central to natural language and to involve domain-specific knowledge irreducible to communicative efficiency. In this paper, we challenge the conventional view by investigating whether structure dependence realizes efficient communication, focusing on coordinate structures.We design three types of artificial languages: (i) one with a structure-dependent reduction operation, which is similar to natural language, (ii) one without any reduction operations, and (iii) one with a linear (rather than structure-dependent) reduction operation.We quantify the communicative efficiency of these languages.The results demonstrate that the language with the structure-dependent reduction operation is significantly more communicatively efficient than the counterfactual languages.This suggests that the existence of structure-dependent properties can be explained from the perspective of efficient communication.

pdf bib abs

Large Language Model Recall Uncertainty is Modulated by the Fan Effect
Jesse Roberts | Kyle Moore | Douglas Fisher | Oseremhen Ewaleifoh | Thao Pham

This paper evaluates whether large language models (LLMs) exhibit cognitive fan effects, similar to those discovered by Anderson in humans, after being pre-trained on human textual data. We conduct two sets of in-context recall experiments designed to elicit fan effects. Consistent with human results, we find that LLM recall uncertainty, measured via token probability, is influenced by the fan effect. Our results show that removing uncertainty disrupts the observed effect. The experiments suggest the fan effect is consistent whether the fan value is induced in-context or in the pre-training data. Finally, these findings provide in-silico evidence that fan effects and typicality are expressions of the same phenomena.

pdf bib abs

Continuous Attentive Multimodal Prompt Tuning for Few-Shot Multimodal Sarcasm Detection
Soumyadeep Jana | Animesh Dey | Ranbir Singh Sanasam

With the steep rise in multimodal content on social media, multimodal sarcasm detection has gained widespread attention from research communities. Existing studies depend on large-scale data, which is challenging to obtain and expensive to annotate. Thus, investigating this problem in a few-shot scenario is required. Overtly complex multimodal models are prone to overfitting on in-domain data, which hampers their performance on out-of-distribution (OOD) data. To address these issues, we propose Continuous Attentive Multimodal Prompt Tuning model (CAMP), that leverages the prompt tuning paradigm to handle few-shot multimodal sarcasm detection. To overcome the siloed learning process of continuous prompt tokens, we design a novel, continuous multimodal attentive prompt where the continuous tokens intricately engage with both image and text tokens, enabling the assimilation of knowledge from different input modalities. Experimental results indicate that our method outperforms other multimodal baseline methods in the few-shot setting and OOD scenarios.

pdf bib abs

Aligning Alignments: Do Colexification and Distributional Similarity Align as Measures of cross-lingual Lexical Alignment?
Taelin Karidi | Eitan Grossman | Omri Abend

The data-driven investigation of the extent to which lexicons of different languages align has mostly fallen into one of two categories:colexification-based and distributional. The two approaches are grounded in distinct methodologies, operate on different assumptions, and are used in diverse ways.This raises two important questions: (a) are there settings in which the predictions of the two approaches can be directly compared? and if so, (b) what is the extent of the similarity and what are its determinants? We offer novel operationalizations for the two approaches in a manner that allows for their direct comparison, and conduct a comprehensive analysis on a diverse set of 16 languages.Our analysis is carried out at different levels of granularity. At the word-level, the two methods present different results across the board. However, intriguingly, at the level of semantic domains (e.g., kinship, quantity), the two methods show considerable convergence in their predictions.A detailed comparison of the metrics against a carefully validated dataset of kinship terms shows that the distributional methods likely capture a more fine-grained alignment than their counterpart colexification-based methods, and may thus be more suited for settings where fewer languages are evaluated.

pdf bib abs

Text2Afford: Probing Object Affordance Prediction abilities of Language Models solely from Text
Sayantan Adak | Daivik Agrawal | Animesh Mukherjee | Somak Aditya

We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs).A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances – Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances.

pdf bib abs

The ability to compare by analogy, metaphorically or not, lies at the core of how humans understand the world and communicate. In this paper, we study the likelihood of metaphoric outputs, and the capability of a wide range of pretrained transformer-based language models to identify metaphors from other types of analogies, including anomalous ones. In particular, we are interested in discovering whether language models recognise metaphorical analogies equally well as other types of analogies, and whether the model size has an impact on this ability. The results show that there are relevant differences using perplexity as a proxy, with the larger models reducing the gap when it comes to analogical processing, and for distinguishing metaphors from incorrect analogies. This behaviour does not result in increased difficulties for larger generative models in identifying metaphors in comparison to other types of analogies from anomalous sentences in a zero-shot generation setting, when perplexity values of metaphoric and non-metaphoric analogies are similar.

pdf bib abs

Further Compressing Distilled Language Models via Frequency-aware Partial Sparse Coding of Embeddings
Kohki Tamura | Naoki Yoshinaga | Masato Neishi

Although pre-trained language models (PLMs) are effective for natural language understanding (NLU) tasks, they demand a huge computational resource, thus preventing us from deploying them on edge devices. Researchers have therefore applied compression techniques for neural networks, such as pruning, quantization, and knowledge distillation, to the PLMs. Although these generic techniques can reduce the number of internal parameters of hidden layers in the PLMs, the embedding layers tied to the tokenizer arehard to compress, occupying a non-negligible portion of the compressed model. In this study, aiming to further compress PLMs reduced by the generic techniques, we exploit frequency-aware sparse coding to compress the embedding layers of the PLMs fine-tuned to downstream tasks. To minimize the impact of the compression on the accuracy, we retain the embeddings of common tokens as they are and use them to reconstruct embeddings of rare tokens by locally linear mapping. Experimental results on the GLUE and JGLUE benchmarks for language understanding in English and Japanese confirmed that our method can further compress the fine-tuned DistilBERT models models while maintaining accuracy.

pdf bib abs

Translating Across Cultures: LLMs for Intralingual Cultural Adaptation
Pushpdeep Singh | Mayur Patidar | Lovekesh Vig

LLMs are increasingly being deployed for multilingual applications and have demonstrated impressive translation capabilities between several low and high-resource languages. An aspect of translation that often gets overlooked is that of cultural adaptation, or modifying source culture references to suit the target culture. While specialized translation models still outperform LLMs on the machine translation task when viewed from the lens of correctness, they are not sensitive to cultural differences often requiring manual correction. LLMs on the other hand have a rich reservoir of cultural knowledge embedded within its parameters that can be potentially exploited for such applications. In this paper, we define the task of cultural adaptation and create an evaluation framework to evaluate the performance of modern LLMs for cultural adaptation and analyze their cross-cultural knowledge while connecting related concepts across different cultures. We also analyze possible issues with automatic adaptation. We hope that this task will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.

pdf bib abs

We seek to explain the causes of the misclassification of the most challenging documents, namely those that no classifier using state-of-the-art, very semantically-separable contextual embedding representations managed to predict accurately. To do so, we propose a taxonomy of incorrect predictions, which we used to perform qualitative human evaluation. We posed two (research) questions, considering three sentiment datasets in two different domains – movie and product reviews. Evaluators with two different backgrounds evaluated documents by comparing the predominant sentiment assigned by the model to the label in the gold dataset in order to decide on a likely misclassification reason. Based on a high inter-evaluator agreement (81.7%), we observed significant differences between the product and movie review domains, such as the prevalence of ambivalence in product reviews and sarcasm in movie reviews. Our analysis also revealed an unexpectedly high rate of incorrect labeling in the gold dataset (up to 33%) and a significant amount of incorrect prediction by the model due to a series of linguistic phenomena (including amplified words, contrastive markers, comparative sentences, and references to world knowledge). Overall, our taxonomy and methodology allow us to explain between 80%-85% of the errors with high confidence (agreement) – enabling us to point out where future efforts to improve models should be concentrated.

pdf bib abs

A Multimodal Large Language Model “Foresees” Objects Based on Verb Information but Not Gender
Shuqi Wang | Xufeng Duan | Zhenguang Cai

This study employs the classical psycholinguistics paradigm, the visual world eye-tracking paradigm (VWP), to explore the predictive capabilities of LLAVA, a multimodal large language model (MLLM), and compare them with human anticipatory gaze behaviors. Specifically, we examine the attention weight distributions of LLAVA when presented with visual displays and English sentences containing verb and gender cues. Our findings reveal that LLAVA, like humans, can predictively attend to objects relevant to verbs, but fails to demonstrate gender-based anticipatory attention. Layer-wise analysis indicates that the middle layers of the model are more related to predictive attention than the early or late layers. This study is pioneering in applying psycholinguistic paradigms to compare the multimodal predictive attention of humans and MLLMs, revealing both similarities and differences between them.

We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly.We investigate the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, we developed two RPO methods, RPO-Traj and RPO-Batch, to adapt to different settings.Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, can effectively learn and apply action principles to enhance performance.

pdf bib abs

Image-conditioned human language comprehension and psychometric benchmarking of visual language models
Subha Nawer Pushpita | Roger P. Levy

Large language model (LLM)s’ next-word predictions have shown impressive performance in capturing human expectations during real-time language comprehension. This finding has enabled a line of research on psychometric benchmarking of LLMs against human language-comprehension data in order to reverse-engineer humans’ linguistic subjective probability distributions and representations. However, to date, this work has exclusively involved unimodal (language-only) comprehension data, whereas much human language use takes place in rich multimodal contexts. Here we extend psychometric benchmarking to visual language models (VLMs). We develop a novel experimental paradigm, Image-Conditioned Maze Reading, in which participants first view an image and then read a text describing an image within the Maze paradigm, yielding word-by-word reaction-time measures with high signal-to-noise ratio and good localization of expectation-driven language processing effects. We find a large facilitatory effect of correct image context on language comprehension, not only for words such as concrete nouns that are directly grounded in the image but even for ungrounded words in the image descriptions. Furthermore, we find that VLM surprisal captures most to all of this effect. We use these findings to benchmark a range of VLMs, showing that models with lower perplexity generally have better psychometric performance, but that among the best VLMs tested perplexity and psychometric performance dissociate. Overall, our work offers new possibilities for connecting psycholinguistics with multimodal LLMs for both scientific and engineering goals.

pdf bib abs

Self-supervised speech representations display some human-like cross-linguistic perceptual abilities
Joselyn Rodriguez | Kamala Sreepada | Ruolan Leslie Famularo | Sharon Goldwater | Naomi Feldman

State of the art models in automatic speech recognition have shown remarkable improvements due to modern self-supervised (SSL) transformer-based architectures such as wav2vec 2.0 (Baevski et al., 2020). However, how these models encode phonetic information is still not well understood. We explore whether SSL speech models display a linguistic property that characterizes human speech perception: language specificity. We show that while wav2vec 2.0 displays an overall language specificity effect when tested on Hindi vs. English, it does not resemble human speech perception when tested on finer-grained differences in Hindi speech contrasts.

pdf bib abs

One-Vs-Rest Neural Network English Grapheme Segmentation: A Linguistic Perspective
Samuel Rose | Nina Dethlefs | C. Kambhampati

Grapheme-to-Phoneme (G2P) correspondences form foundational frameworks of tasks such as text-to-speech (TTS) synthesis or automatic speech recognition. The G2P process involves taking words in their written form and generating their pronunciation. In this paper, we critique the status quo definition of a grapheme, currently a forced alignment process relating a single character to either a phoneme or a blank unit, that underlies the majority of modern approaches. We develop a linguistically-motivated redefinition from simple concepts such as vowel and consonant count and word length and offer a proof-of-concept implementation based on a multi-binary neural classification task. Our model achieves state-of-the-art results with a 31.86% Word Error Rate on a standard benchmark, while generating linguistically meaningful grapheme segmentations.

pdf bib abs

CrowdCounter: A benchmark type-specific multi-target counterspeech dataset
Punyajoy Saha | Abhilash Datta | Abhik Jana | Animesh Mukherjee

Counterspeech presents a viable alternative to banning or suspending users for hate speech while upholding freedom of expression. However, writing effective counterspeech is challenging for moderators/users. Hence, developing suggestion tools for writing counterspeech is the need of the hour. One critical challenge in developing such a tool is the lack of quality and diversity of the responses in the existing datasets. Hence, we introduce a new dataset - CrowdCounter containing 3,425 hate speech-counterspeech pairs spanning six different counterspeech types (empathy, humor, questioning, warning, shaming, contradiction), which is the first of its kind. The design of our annotation platform itself encourages annotators to write type-specific, non-redundant and high-quality counterspeech. We evaluate two frameworks for generating counterspeech responses - vanilla and type-controlled prompts - across four large language models. In terms of metrics, we evaluate the responses using relevance, diversity and quality. We observe that Flan-T5 is the best model in the vanilla framework across different models. Type-specific prompts enhance the relevance of the responses, although they might reduce the language quality. DialoGPT proves to be the best at following the instructions and generating the type-specific counterspeech accurately.

pdf bib abs

Solving the Challenge Set without Solving the Task: On Winograd Schemas as a Test of Pronominal Coreference Resolution
Ian Porada | Jackie CK Cheung

Challenge sets such as the Winograd Schema Challenge (WSC) are used to benchmark systems’ ability to resolve ambiguities in natural language. If one assumes as in existing work that solving a given challenge set is at least as difficult as solving some more general task, then high performance on the challenge set should indicate high performance on the general task overall. However, we show empirically that this assumption of difficulty does not always hold. In particular, we demonstrate that despite the strong performance of prompted language models (LMs) on the WSC and its variants, these same modeling techniques perform relatively poorly at resolving certain pronominal ambiguities attested in OntoNotes and related datasets that are perceived to be easier. Motivated by these findings, we propose a method for ensembling a prompted LM with a supervised, task-specific system that is overall more accurate at resolving pronominal coreference across datasets. Finally, we emphasize that datasets involving the same linguistic phenomenon draw on distinct, but overlapping, capabilities, and evaluating on any one dataset alone does not provide a complete picture of a system’s overall capability.

pdf bib abs

Advancing Arabic Sentiment Analysis: ArSen Benchmark and the Improved Fuzzy Deep Hybrid Network
Yang Fang | Cheng Xu | Shuhao Guan | Nan Yan | Yuke Mei

Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.

pdf bib abs

Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content
Tyler Malloy | Maria José Ferreira | Fei Fang | Cleotilde Gonzalez

Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.

pdf (full)
bib (full) The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

pdf bib

pdf bib abs

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year’s BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.

pdf bib abs

Towards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning
Mohammad Amin Ghanizadeh | Mohammad Javad Dousti

In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs) and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient language models that better mimic human learning processes.

pdf bib abs

BabyLM Challenge: Experimenting with Self-Distillation and Reverse-Distillation for Language Model Pre-Training on Constrained Datasets
Aakarsh Nair | Alina Hancharova | Mayank Kumar | Ali Gharaee

Language models (LMs) exhibit significant data inefficiency compared to human learners. A child is able to master language while consuming less than 100 million words of input, while language models require orders of magnitude more tokens during training. Our submission to the BabyLM Challenge utilizes a combination of self-distillation and reverse-distillation to train a sequence of ensemble models with improved training characteristics on a fixed-size 10 million-word dataset. Self-distillation is used to generate an ensemble of models of a certain fixed size, while reverse distillation is used to train a more expressive larger model from a previously trained generation of relatively smaller models, while largely preserving learned accuracy.We find that ensembles consisting of two smaller models and one identical born-again model serve as ideal ensembles for each trained generation of model size. We demonstrate that, although our method is not novel, it provides consistent and modest performance improvements on the BLiMP and GLUE benchmarks.

pdf bib abs

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Zébulon Goriely | Richard Diehl Martinez | Andrew Caines | Paula Buttery | Lisa Beinborn

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

pdf bib abs

Graphemes vs. phonemes: battling it out in character-based language models
Bastian Bunzeck | Daniel Duran | Leonie Schade | Sina Zarrieß

We present grapheme-llama and phoneme-llama, character-based language models trained for the 2024 BabyLM challenge. Through these models, we explore an under-researched approach to downsizing: replacing subword-based tokenization with character-level tokenization, drastically reducing the vocabulary size. The grapheme model is trained on a standard BabyLM dataset, while the phoneme model uses a phoneme-converted version of this dataset. Results show that grapheme-based models perform better overall, achieving scores comparable to subword-based models on grammatical benchmarks. Despite lower performance, phoneme models also demonstrate promising grammatical learning. We argue that our results challenge conventional wisdom on language modeling techniques and open up novel research questions with character- and phoneme-based models as objects of inquiry.

pdf bib abs

Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training
Rohan Saha | Abrar Fahim | Alona Fyshe | Alex Murphy

For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to do more with less, such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient machine learning also take inspiration from human learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.

pdf bib abs

BabyHGRN: Exploring RNNs for Sample-Efficient Language Modeling
Patrick Haller | Jonas Golde | Alan Akbik

This paper explores the potential of recurrent neural networks (RNNs) and other subquadratic architectures as competitive alternatives to transformer-based models in low-resource language modeling scenarios. We utilize HGRN2 (Qin et al., 2024), a recently proposed RNN-based architecture, and comparatively evaluate its effectiveness against transformer-based baselines and other subquadratic architectures (LSTM, xLSTM, Mamba). Our experimental results show that, our HGRN2 language model, outperforms transformer-based models in both the 10M and 100M word tracks of the challenge, as measured by their performance on the BLiMP, EWoK, GLUE and BEAR benchmarks. Further, we show the positive impact of knowledge distillation. Our findings challenge the prevailing focus on transformer architectures and indicate the viability of RNN-based models, particularly in resource-constrained environments.

pdf bib abs

Choosy Babies Need One Coach: Inducing Mode-Seeking Behavior in BabyLlama with Reverse KL Divergence
Shaozhen Shi | Yevgen Matusevych | Malvina Nissim

This study presents our submission to the Strict-Small Track of the 2nd BabyLM Challenge. We use a teacher-student distillation setup with the BabyLLaMa model (Timiryasov and Tastet, 2023) as a backbone. To make the student’s learning process more focused, we replace the objective function with a reverse Kullback-Leibler divergence, known to cause mode-seeking (rather than mode-averaging) behaviour in computational learners. We further experiment with having a single teacher (instead of an ensemble of two teachers) and implement additional optimization strategies to improve the distillation process. Our experiments show that under reverse KL divergence, a single-teacher model often outperforms or matches multiple-teacher models across most tasks. Additionally, incorporating advanced optimization techniques further enhances model performance, demonstrating the effectiveness and robustness of our proposed approach. These findings support our idea that “choosy babies need one coach”.

pdf bib abs

This work explores alternative gating systems in simple Recurrent Neural Networks (RNNs) that induce linguistically motivated biases during training, ultimately affecting models’ performance on the BLiMP task. We focus exclusively on the BabyLM 10M training corpus (Strict-Small Track). Our experiments reveal that: (i) standard RNN variants—LSTMs and GRUs—are insufficient for properly learning the relevant set of linguistic constraints; (ii) the quality or size of the training corpus has little impact on these networks, as demonstrated by the comparable performance of LSTMs trained exclusively on the child-directed speech portion of the corpus; (iii) increasing the size of the embedding and hidden layers does not significantly improve performance. In contrast, specifically gated RNNs (eMG-RNNs), inspired by certain Minimalist Grammar intuitions, exhibit advantages in both training loss and BLiMP accuracy.

pdf bib abs

Developmentally Plausible Multimodal Language Models Are Highly Modular
Alina Klerings | Christian Bartelt | Aaron Mueller

Large language models demonstrate emergent modularity, where functionally specialized components and circuits arise to handle specific tasks or task formats. If similar modules arise in models trained on more cognitively plausible datasets, it could inform debates surrounding what kinds of would be learnable given more human-like language learning signals. In this paper, we describe a multimodal vision-language model submitted to the BabyLM Challenge. Our model achieves similar performance to the best-performing architectures from last year, though visual information does not improve performance on text-only tasks over text-only models (in accordance with prior findings). To better understand how the model processes the evaluation tasks of the BabyLM Challenge, we leverage causal interpretability methods to locate the neurons that contribute to the model’s final decisions. We find that the models we train are highly modular: distinct components arise to process related tasks. Furthermore, on text-and-image tasks, adding or removing visual inputs causes the model to use distinct components to process the same textual inputs. This suggests that modal and task-specific specialization is efficiently learned, and that a high degree of functional specialization arises in even small-scale language models.

pdf bib abs

ELC-ParserBERT: Low-Resource Language Modeling Utilizing a Parser Network With ELC-BERT
Rufus Behr

This paper investigates the effect of including a parser network, which produces syntactic heights and distances to perform unsupervised parsing, in the Every Layer Counts BERT (ELC-BERT) architecture trained on 10M tokens for the 2024 BabyLM challenge. The parser network’s inclusion in this setup shows little or no improvement over the ELC-BERT baseline for the BLiMP and GLUE evaluation, but, in particular domains of the EWoK evaluation framework, its inclusion shows promise for improvement and raises interesting questions about its effect on learning different concepts.

pdf bib abs

Extending the BabyLM Initiative : Promoting Diversity in Datasets and Metrics through High-Quality Linguistic Corpora
Laurent Prévot | Sheng-Fu Wang | Jou-An Chi | Shu-Kai Hsieh

BabyLM paves the way for a range of experiments aimed at better understanding language models (LMs) and the differences and similarities between human and artificial language learning. However, the current framework is limited to the English language and a narrow but significant range of evaluation metrics, primarily focused on syntax, semantics, and pragmatics. In this paper, we propose some steps towards extending the framework to other languages, specifically Mandarin Chinese and French, leveraging existing linguistic resources for these languages. Additionally, we advocate for greater exploration of genre variations within subcorpora for training LMs, as well as for the adoption of additional evaluation metrics with different underlying principles. Our proposal consists of using high-quality spontaneous speech corpora as a source for extracting production-related variables, which the models are then fine-tuned to predict. We hypothesize that these production-related features offer insights into the language processing mechanisms underlying the data and that cognitively sensitive models should outperform others in predicting these features. Specifically, we propose focusing on the prediction of phenomena such as speech reductions, prosodic prominences, sequences co-occurring with listeners’ backchannels, and disfluencies. To illustrate our approach, we present an example involving the prediction of speech reductions in spontaneous speech in two different languages (French and English), using models trained on 10 million tokens from different data source mixtures. Although the results are preliminary, they suggest that this task can characterize models for predicting human language processing.

pdf bib abs

Integrating Quasi-symbolic Conceptual Knowledge into Language Model Pre-training
Gábor Berend

In this paper, we investigate the integration of latent conceptual knowledge into the pre-training of masked language models. Our solution is based on the use of an auxiliary model, from which we extract training signals for training a student model. We determine the training signals from the hidden representations of the student model in an unsupervised way, using sparse coding. Models trained on latent concepts alone have an improved fine-tunability on downstream tasks, however, they perform worse on traditional language modeling, i.e., when the goal is to output missing tokens as opposed to latent semantic classes of words. In order to preserve the improved fine-tuning capability of the models, while making them better at the task of language modeling, we propose a final stage of pre-training, during which we perform traditional masked language modeling. The final stage of pre-training is based on a model that has already been pre-trained on the task of modeling latent semantic properties, with the weights of the backbone model being frozen. During the final training phase, we only train a lightweight linear classifier layer on top of the logits that the model determines for the latent semantic properties. With this modification, we can obtain the benefits of both the traditional training paradigms and the one which is based on the use of latent semantic properties. We release our source code at github.com/SzegedAI/MLSM.

pdf bib abs

Are BabyLMs Second Language Learners?
Lukas Edman | Lisa Bylinina | Faeze Ghorbanpour | Alexander Fraser

This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge. Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data.We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two best models being trained on 1) a mix of paraphrase data and data from the BabyLM pretraining dataset, and 2) exclusively paraphrase data.

pdf bib abs

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies
Suchir Salhan | Richard Diehl Martinez | Zébulon Goriely | Paula Buttery

Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards & MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.

pdf bib abs

ConcreteGPT: A Baby GPT-2 Based on Lexical Concreteness and Curriculum Learning
Luca Capone | Alessandro Bondielli | Alessandro Lenci

We present a model for the Strict-Small track of the BabyLM Challenge 2024 (Choshen et al. 2024). We introduce a Curriculum Learning approach for training a specialized version of GPT-2 (Radford et al. 2019), that we name ConcreteGPT. We utilize the norms from (Brysbaert et al. 2014) which provide concreteness ratings for 40,000 English lexical items based on human subjects. Using these norms, we assign a concreteness score to each sentence in the training dataset and develop two curriculum strategies that progressively introduce more complex and abstract language patterns in the training data. Compared to the baselines, our best model shows lower performance on zero-shot tasks but demonstrates superior performance in fine-tuning tasks. Notably, our curriculum-trained models exhibit significant improvements over a non-curriculum based training of the same model.

pdf bib abs

When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?
Srikrishna Iyer

We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.

pdf bib abs

Automatic Quality Estimation for Data Selection and Curriculum Learning
Hiep Nguyen | Lynn Yip | Justin DeBenedetto

The size of neural models within natural language processing has increased at a rapid pace in recent years.With this increase in model size comes an increase in the amount of training data required for training.While these larger models have shown strong performance, their use comes with added training and data costs, can be resource-prohibitive for many researchers, and uses an amount of language data that is not always available for all languages.This work focuses on exploring quality estimation as a method of data selection or filtering.The aim is to provide models with higher quality data as compared to larger amounts of data.This approach was applied to machine translation models with varying data sizes as well as to the BabyLM Challenge.Given the 100M word dataset provided in the BabyLM Challenge, we test out various strategies for selecting 10M words for pretraining and use a curriculum learning approach based on the quality estimation scoring.We find small improvements in certain data settings.

pdf bib abs

Using Curriculum Masking Based on Child Language Development to Train a Large Language Model with Limited Training Data
Evan Lucas | Dylan Gaines | Tagore Rao Kosireddy | Kevin Li | Timothy C. Havens

In this paper we detail our submissions to the Strict and Strict-Small tracks of the 2024 BabyLM Challenge. We approach this challenge with two methodologies: i) use of a novel dataset, and ii) development of a pre-training technique based on the fusion of child language acquisition with traditional masked language modeling, which we call curriculum masking. The novel dataset used for this task is based on user submissions to the Reddit forum (i.e., subreddit) “Explain Like I’m Five”, which explains diverse concepts using simple language. Curriculum masking works by creating learning phases based on a standard child language development timeline, where the masked words learned by the model start with simple nouns and gradually expand to include more complex parts of speech. We show that using internet-based training data shows a small improvement in evaluation scores as compared to baseline training data. Our proposed pre-training method of curriculum masking is conceptually novel and also shows improved rates of learning over typical masked language modeling pre-training, potentially allowing for good performance with fewer total epochs on smaller training datasets. Code for the curriculum masking implementation is shared at https://github.com/evan-person/curriculumMaskingBabyLM2024.

pdf bib abs

WhatIf: Leveraging Word Vectors for Small-Scale Data Augmentation
Alex Lyman | Bryce Hepner

We introduce WhatIf, a lightly supervised data augmentation technique that leverages word vectors to enhance training data for small-scale language models. Inspired by reading prediction strategies used in education, WhatIf creates new samples by substituting semantically similar words in the training data. We evaluate WhatIf on multiple datasets, demonstrating small but consistent improvements in downstream evaluation compared to baseline models. Finally, we compare WhatIf to other small-scale data augmentation techniques and find that it provides comparable quantitative results at a potential tradeoff to qualitative evaluation.

pdf bib abs

A surprisal oracle for when every layer counts
Xudong Hong | Sharid Loáiciga | Asad Sayeed

Active Curriculum Language Modeling (ACLM; Hong et al., 2023) is a learner-directed approach to training a language model. We proposed the original version of this process in our submission to the BabyLM 2023 task, and now we propose an updated ACLM process for the BabyLM 2024 task. ACLM involves an iteratively-and dynamically-constructed curriculum informed over the training process by a model of uncertainty; other training items that are similarly uncertain to a least certain candidate item are prioritized. Our new process improves the similarity model so that it is more dynamic, and we run ACLM over the most successful model from the BabyLM 2023 task: ELC-BERT (Charpentier and Samuel, 2023). We find that while our models underperform on fine-grained grammatical inferences, they outperform the BabyLM 2024 official base-lines on common-sense and world-knowledge tasks. We make our code available at https://github.com/asayeed/ActiveBaby.

pdf bib abs

Dreaming Out Loud: A Self-Synthesis Approach For Training Vision-Language Models With Developmentally Plausible Data
Badr AlKhamissi | Yingtian Tang | Abdülkadir Gökce | Johannes Mehrer | Martin Schrimpf

While today’s large language models exhibit impressive abilities in generating human-like text, they require massive amounts of data during training. We here take inspiration from human cognitive development to train models in limited data conditions. Specifically we present a self-synthesis approach that iterates through four phases: Phase 1 sets up fundamental language abilities, training the model from scratch on a small corpus. Language is then associated with the visual environment in phase 2, integrating the model with a vision encoder to generate descriptive captions from labeled images. In the “self-synthesis” phase 3, the model generates captions for unlabeled images, that it then uses to further train its language component with a mix of synthetic, and previous real-world text. This phase is meant to expand the model’s linguistic repertoire, similar to humans self-annotating new experiences. Finally, phase 4 develops advanced cognitive skills, by training the model on specific tasks such as visual question answering and reasoning. Our approach offers a proof of concept for training a multimodal model using a developmentally plausible amount of data.

pdf bib abs

BabyLM Challenge: Exploring the effect of variation sets on language model training efficiency
Akari Haga | Akiyo Fukatsu | Miyu Oba | Arianna Bisazza | Yohei Oseki

While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.

pdf bib abs

GPT or BERT: why not both?
Lucas Georges Gabriel Charpentier | David Samuel

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack – GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

pdf bib abs

What should Baby Models read? Exploring Sample-Efficient Data Composition on Model Performance
Hong Meng Yam | Nathan Paek

We explore the impact of pre-training data composition on the performance of small language models in a sample-efficient setting. Using datasets capped at 10 million words, we evaluate several data sources—including child-directed speech (CHILDES), classic fiction (Gutenberg), a mixed dataset (Mix), and synthetic TinyStories—across different model sizes ranging from 18 million to 705 million parameters. Our experiments show that smaller models (e.g., GPT2-18M and GPT2-44M) benefit from training on diverse datasets like Mix, achieving better performance on linguistic benchmarks. In contrast, larger models (e.g., GPT2-97M, GPT2-705M, and LLaMA-360M) perform better when trained on more complex and rich datasets like Gutenberg. Models trained on the CHILDES and TinyStories datasets underperformed across all model sizes. These findings suggest that the optimal dataset for sample-efficient training depends on the model size, and that neither child-directed speech nor simplified stories are optimal for small language models of all sizes. We highlight the importance of considering both dataset composition and model capacity for effective sample-efficient language model training.

pdf bib abs

BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data
Jean-Loup Tastet | Inar Timiryasov

We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On the BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.

pdf bib abs

Teaching Tiny Minds: Exploring Methods to Enhance Knowledge Distillation for Small Language Models
Hong Meng Yam | Nathan Paek

In this paper, we build off of the success of the previous BabyLM challenge winner’s model, BabyLlama, to explore various methods of enhancing knowledge distillation for small language models. Our main focus is on investigating how small a language model can be while still maintaining competitive performance. We experiment with three main approaches: (1) DistilledGPT-44M, which uses smaller teacher models and a more compact student model compared to BabyLlama; (2) ContrastiveLlama-58M, which incorporates contrastive loss into the knowledge distillation process; and (3) MaskedAdversarialLlama-58M, incorporates adversarial loss into the knowledge distillation process. Using the 10M-word dataset from the BabyLM challenge’s strict-small track, we evaluate our models on the BLiMP, EWoK, and GLUE benchmarks. Our results show that effective knowledge distillation can still be achieved with significantly smaller teacher and student models. In particular, our model DistilledGPT-44M is able to achieve better performance than one of last year’s winning entries, LTG-BERT, while achieving similar performance but cutting training time by around 70% and parameters by around 25% compared to the other winning entry, BabyLlama.

pdf bib abs

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training
Nikitas Theodoropoulos | Giorgos Filandrianos | Vassilis Lyberatos | Maria Lymperaiou | Giorgos Stamou

We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using *TinyStories*: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of *TinyStories*, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train *LTG-BERT* encoder models on a combined dataset of: a subset of *TinyStories*, story completions generated by GPT-Neo, and a subset of the *BabyLM* dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.

pdf bib abs

Causal Language Modeling (CLM) and Masked Language Modeling (MLM) are two mainstream learning paradigms based on Transformer networks, specifically the Decoder-only and Encoder-only architectures. The strengths of each paradigm in downstream tasks have shown a mix of advantages and disadvantages. In the past BabyLM Challenge 2023, although the MLM paradigm achieved the best average performance, the CLM paradigm demonstrated significantly faster convergence rates. For the BabyLM Challenge 2024, we propose a novel language modeling paradigm named AntLM, which integrates both CLM and MLM to leverage the advantages of these two classic paradigms. We chose the strict-small track and conducted experiments on two foundation models: BabyLlama, representing CLM, and LTG-BERT, representing MLM. During the training process for specific foundation models, we alternate between applying CLM or MLM training objectives and causal or bidirectional attention masks. Experimental results show that combining the two pretraining objectives leverages their strengths, enhancing overall training performance. Under the same epochs, AntLM_BabyLlama improves Macro-average by 1%, and AntLM_LTG-BERT achieves a 2.2% increase over the baselines.