Search Query Language Identification Using Weak Labeling

Ritiz Tambi, Ajinkya Kale, Tracy Holloway King


Abstract
Language identification is a well-known task for natural language documents. In this paper we explore search query language identification which is usually the first task before any other query understanding. Without loss of generalization, we run our experiments on the Adobe Stock search engine. Even though the domain is relatively generic because Adobe Stock queries cover a broad range of objects and concepts, out-of-the-box language identifiers do not perform well due to the extremely short text found in queries. Unlike other well-studied supervised approaches for this task, we examine a practical approach for the cold start problem for automatically getting large-scale query-language pairs for training. We describe the process of creating weak-labeled training data and then human-annotated evaluation data for the search query language identification task. The effectiveness of this technique is demonstrated by training a gradient boosting model for language classification given a query. We out-perform the open domain text model baselines by a large margin.
Anthology ID:
2020.lrec-1.432
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3520–3527
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.432
DOI:
Bibkey:
Cite (ACL):
Ritiz Tambi, Ajinkya Kale, and Tracy Holloway King. 2020. Search Query Language Identification Using Weak Labeling. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3520–3527, Marseille, France. European Language Resources Association.
Cite (Informal):
Search Query Language Identification Using Weak Labeling (Tambi et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.432.pdf