期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2011
卷号:2011
出版社:ACL Anthology
摘要:We introduce a new publicly available tool
that implements efficient indexing and retrieval
of large N-gram datasets, such as the
Web1T 5-gram corpus. Our tool indexes the
entire Web1T dataset with an index size of
only 100 MB and performs a retrieval of any
N-gram with a single disk access. With an
increased index size of 420 MB and duplicate
data, it also allows users to issue wild
card queries provided that the wild cards in the
query are contiguous. Furthermore, we also
implement some of the smoothing algorithms
that are designed specifically for large datasets
and are shown to yield better language models
than the traditional ones on the Web1T 5-
gram corpus (Yuret, 2008). We demonstrate
the effectiveness of our tool and the smoothing
algorithms on the English Lexical Substitution
task by a simple implementation that
gives considerable improvement over a basic
language model.