Amir Kamran
2020
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
CEF Data Marketplace: Powering a Long-term Supply of Language Data
Amir Kamran | Dace Dzeguze | Jaap van der Meer | Milica Panic | Alessandro Cattelan | Daniele Patrioli | Luisa Bentivogli | Marco Turchi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
Amir Kamran | Dace Dzeguze | Jaap van der Meer | Milica Panic | Alessandro Cattelan | Daniele Patrioli | Luisa Bentivogli | Marco Turchi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
We describe the CEF Data Marketplace project, which focuses on the development of a trading platform of translation data for language professionals: translators, machine translation (MT) developers, language service providers (LSPs), translation buyers and government bodies. The CEF Data Marketplace platform will be designed and built to manage and trade data for all languages and domains. This project will open a continuous and longterm supply of language data for MT and other machine learning applications.
2017
Results of the WMT17 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran
Proceedings of the Second Conference on Machine Translation
Ondřej Bojar | Yvette Graham | Amir Kamran
Proceedings of the Second Conference on Machine Translation
2016
Results of the WMT16 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran | Miloš Stanojević
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Ondřej Bojar | Yvette Graham | Amir Kamran | Miloš Stanojević
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Results of the WMT16 Tuning Shared Task
Bushra Jawaid | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Bushra Jawaid | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Enriching Source for English-to-Urdu Machine Translation
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
This paper focuses on the generation of case markers for free word order languages that use case markers as phrasal clitics for marking the relationship between the dependent-noun and its head. The generation of such clitics becomes essential task especially when translating from fixed word order languages where syntactic relations are identified by the positions of the dependent-nouns. To address the problem of missing markers on source-side, artificial markers are added in source to improve alignments with its target counterparts. Up to 1 BLEU point increase is observed over the baseline on different test sets for English-to-Urdu.
2015
Results of the WMT15 Metrics Shared Task
Miloš Stanojević | Amir Kamran | Philipp Koehn | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation
Miloš Stanojević | Amir Kamran | Philipp Koehn | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation
Results of the WMT15 Tuning Shared Task
Miloš Stanojević | Amir Kamran | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation
Miloš Stanojević | Amir Kamran | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation
2014
A Tagged Corpus and a Tagger for Urdu
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing. The standalone tagger obtains the accuracy of 88.74% on test data.
English to Urdu Statistical Machine Translation: Establishing a Baseline
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing
2012
Search
Fix author
Co-authors
- Ondřej Bojar 10
- Bushra Jawaid 5
- Miloš Stanojević 5
- Yvette Graham 2
- Philipp Koehn 2
- Marta Bañón 1
- Luisa Bentivogli 1
- Alessandro Cattelan 1
- Pinzhen Chen 1
- Dace Dzeguze 1
- Miquel Esplà-Gomis 1
- Mikel L. Forcada 1
- Petra Galuščáková 1
- Barry Haddow 1
- Kenneth Heafield 1
- Hieu Hoang 1
- Faheem Kirefu 1
- Sergio Ortiz Rojas 1
- Milica Panic 1
- Daniele Patrioli 1
- Gema Ramírez-Sánchez 1
- Elsa Sarrías 1
- Leopoldo Pla Sempere 1
- Marek Strelec 1
- Aleš Tamchyna 1
- Brian Thompson 1
- Marco Turchi 1
- William Waites 1
- Dion Wiggins 1
- Jaume Zaragoza 1
- Jaap van der Meer 1