首页    期刊浏览 2024年12月11日 星期三
登录注册

文章基本信息

  • 标题:The Web as a Parallel Corpus
  • 本地全文:下载
  • 作者:Philip Resnik ; Noah A. Smith
  • 期刊名称:Computational Linguistics
  • 印刷版ISSN:0891-2017
  • 电子版ISSN:1530-9312
  • 出版年度:2003
  • 卷号:29
  • 期号:3
  • 页码:349-380
  • DOI:10.1162/089120103322711578
  • 语种:English
  • 出版社:MIT Press
  • 摘要:Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.
国家哲学社会科学文献中心版权所有