首页    期刊浏览 2024年12月04日 星期三
登录注册

文章基本信息

  • 标题:Web Text Corpus for Natural Language Processing
  • 本地全文:下载
  • 作者:Vinci Liu ; James R. Curran
  • 期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
  • 出版年度:2006
  • 卷号:2006
  • 出版社:ACL Anthology
  • 摘要:Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.
国家哲学社会科学文献中心版权所有