首页    期刊浏览 2024年12月05日 星期四
登录注册

文章基本信息

  • 标题:Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages
  • 本地全文:下载
  • 作者:Nasser Zalmout ; Nizar Habash
  • 期刊名称:The Prague Bulletin of Mathematical Linguistics
  • 印刷版ISSN:0032-6585
  • 电子版ISSN:1804-0462
  • 出版年度:2017
  • 卷号:108
  • 期号:1
  • 页码:257-269
  • DOI:10.1515/pralin-2017-0025
  • 语种:English
  • 出版社:Walter de Gruyter GmbH
  • 摘要:Tokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.
国家哲学社会科学文献中心版权所有