首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:Parallel Data Extraction using Word Embeddings
  • 本地全文:下载
  • 作者:Pintu Lohar ; Andy Way
  • 期刊名称:Computer Science & Information Technology
  • 电子版ISSN:2231-5403
  • 出版年度:2020
  • 卷号:10
  • 期号:15
  • 页码:251-267
  • DOI:10.5121/csit.2020.101521
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:Building a robust MT system requires a sufficiently large parallel corpus to be available as training data. In this paper, we propose to automatically extract parallel sentences from comparable corpora without using any MT system or even any parallel corpus at all. Instead, we use crosslingual information retrieval (CLIR), average word embeddings, text similarity and a bilingual dictionary, thus saving a significant amount of time and effort as no MT system is involved in this process. We conduct experiments on two different kinds of data: (i) formal texts from news domain, and (ii) user-generated content (UGC) from hotel reviews. The automatically extracted sentence pairs are then added to the already available parallel training data and the extended translation models are built from the concatenated data sets. Finally, we compare the performance of our new extended models against the baseline models built from the available data. The experimental evaluation reveals that our proposed approach is capable of improving the translation outputs for both the formal texts and UGC.
  • 关键词:Machine Translation ;parallel data ;user-generated content ;word embeddings ;text similarity ;comparable corpora.
国家哲学社会科学文献中心版权所有