期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2011
卷号:2011
出版社:ACL Anthology
摘要:N-gram language models are a major resource
bottleneck in machine translation. In this paper,
we present several language model implementations
that are both highly compact and
fast to query. Our fastest implementation is
as fast as the widely used SRILM while requiring
only 25% of the storage. Our most
compact representation can store all 4 billion
n-grams and associated counts for the Google
n-gram corpus in 23 bits per n-gram, the most
compact lossless representation to date, and
even more compact than recent lossy compression
techniques. We also discuss techniques
for improving query speed during decoding,
including a simple but novel language model
caching technique that improves the query
speed of our language models (and SRILM)
by up to 300%.