首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:Paraphrasing Training Data for Statistical Machine Translation
  • 本地全文:下载
  • 作者:Eric Nichols ; Francis Bond ; D. Scott Appling
  • 期刊名称:Information and Media Technologies
  • 电子版ISSN:1881-0896
  • 出版年度:2010
  • 卷号:5
  • 期号:2
  • 页码:950-971
  • DOI:10.11185/imt.5.950
  • 出版社:Information and Media Technologies Editorial Board
  • 摘要:Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.
  • 关键词:Natural Language Processing;Machine Translation;Paraphrasing;HPSG
国家哲学社会科学文献中心版权所有