期刊名称:Conference on European Chapter of the Association for Computational Linguistics (EACL)
出版年度:2006
卷号:2006
出版社:ACL Anthology
摘要:We consider the evaluation problem in
Natural Language Generation (NLG) and
present results for evaluating several NLG
systems with similar functionality, including
a knowledge-based generator and several
statistical systems. We compare evaluation
results for these systems by human
domain experts, human non-experts, and
several automatic evaluation metrics, including
NIST, BLEU, and ROUGE. We
find that NIST scores correlate best (>
0.8) with human judgments, but that all
automatic metrics we examined are biased
in favour of generators that select on
the basis of frequency alone. We conclude
that automatic evaluation of NLG
systems has considerable potential, in particular
where high-quality reference texts
and only a small number of human evaluators
are available. However, in general it is
probably best for automatic evaluations to
be supported by human-based evaluations,
or at least by studies that demonstrate that
a particular metric correlates well with human
judgments in a given domain.