2024
pdf
bib
abs
MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
Tianshuo Zhou
|
Sen Mei
|
Xinze Li
|
Zhenghao Liu
|
Chenyan Xiong
|
Zhiyuan Liu
|
Yu Gu
|
Ge Yu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module’s encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also illustrate that the language model has the ability to extract image semantics and partly map the image features to the input word embedding space. All codes are available at https://github.com/OpenMatch/MARVEL.
pdf
bib
abs
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Zhipeng Xu
|
Zhenghao Liu
|
Yukun Yan
|
Zhiyuan Liu
|
Ge Yu
|
Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.
pdf
bib
abs
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair
Hanbin Wang
|
Zhenghao Liu
|
Shuo Wang
|
Ganqu Cui
|
Ning Ding
|
Zhiyuan Liu
|
Ge Yu
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces INTERVENOR (INTERactiVE chaiN Of Repair), a system designed to emulate the interactive code repair processes observed in humans, encompassing both code diagnosis and code repair. INTERVENOR prompts Large Language Models (LLMs) to play distinct roles during the code repair process, functioning as both a Code Learner and a Code Teacher. Specifically, the Code Learner is tasked with adhering to instructions to generate or repair code, while the Code Teacher is responsible for crafting a Chain-of-Repair (CoR) to serve as guidance for the Code Learner. During generating the CoR, the Code Teacher needs to check the generated codes from Code Learner and reassess how to address code bugs based on error feedback received from compilers. Experimental results demonstrate that INTERVENOR surpasses baseline models, exhibiting improvements of approximately 18% and 4.3% over GPT-3.5 in code generation and code translation tasks, respectively. Our further analyses show that CoR is effective to illuminate the reasons behind bugs and outline solution plans in natural language. With the feedback of code compilers, INTERVENOR can accurately identify syntax errors and assertion errors and provide precise instructions to repair codes. All data and codes are available at [https://github.com/NEUIR/INTERVENOR](https://github.com/NEUIR/INTERVENOR).
2023
pdf
bib
abs
Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data
Xinze Li
|
Zhenghao Liu
|
Chenyan Xiong
|
Shi Yu
|
Yu Gu
|
Zhiyuan Liu
|
Ge Yu
Findings of the Association for Computational Linguistics: ACL 2023
This paper presents Structure Aware Dense Retrieval (SANTA) model, which encodes user queries and structured data in one universal embedding space for retrieving structured data. SANTA proposes two pretraining methods to make language models structure-aware and learn effective representations for structured data: 1) Structured Data Alignment, which utilizes the natural alignment relations between structured data and unstructured data for structure-aware pretraining. It contrastively trains language models to represent multi-modal text data and teaches models to distinguish matched structured data for unstructured texts. 2) Masked Entity Prediction, which designs an entity-oriented mask strategy and asks language models to fill in the masked entities. Our experiments show that SANTA achieves state-of-the-art on code search and product search and conducts convincing results in the zero-shot setting. SANTA learns tailored representations for multi-modal text data by aligning structured and unstructured data pairs and capturing structural semantics by masking and predicting entities in the structured data. All codes are available at
https://github.com/OpenMatch/OpenMatch.
2013
pdf
bib
Is Twitter A Better Corpus for Measuring Sentiment Similarity?
Shi Feng
|
Le Zhang
|
Binyang Li
|
Daling Wang
|
Ge Yu
|
Kam-Fai Wong
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing