期刊名称:The Prague Bulletin of Mathematical Linguistics
印刷版ISSN:0032-6585
电子版ISSN:1804-0462
出版年度:2010
卷号:94
期号:1
页码:67-76
DOI:10.2478/v10108-010-0021-7
语种:English
出版社:Walter de Gruyter GmbH
摘要:This work introduces a method and tool for handling overlapping parallel corpora — i.e. corpora that are based on the same source material. The method is insensitive to minor changes in the text, different segmentation levels of the corpora and omitted material from either corpora. The aim is to detect matching sentence pairs and either produce combinations of the overlapping corpora or compare them and assess their quality in comparison to each other. The introduced tool enables the user to define the desired behavior when combining corpora pairs, resulting in pure comparison, maximum-size or maximum-quality versions of the combinations. We test the tool on two cases of overlapping parallel corpora and five language pairs. We also evaluate the impact of using the method on two translation systems — a phrase-based and a parsing-based one.