期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2019
卷号:97
期号:12
页码:3436-3447
出版社:Journal of Theoretical and Applied
摘要:The Pearson correlation is a performance measure that indicates the extent to which two variables are linearly related. When Pearson is applied to the semantic similarity domain, it shows the degree of correlation between scores of dataset test-pairs, the human and the observed similarity scores. However, the Pearson correlation is sensitive to outliers of benchmark datasets. Although many works have tackled the outlier problem, little research has focused on the internal distribution of the benchmark dataset�s bins. A representative and well-distributed text benchmark dataset embody a wide range of similarity scores values; therefore, the benchmark dataset could be considered a cross-sectional dataset. Although a perfect text similarity method could report a high Pearson correlation, the standard Pearson correlation is unaware of correlated individual text pairs in a single dataset�s cross-section due to outliers. Therefore, this paper proposes the normalized mean scaled square error method, inferred from the standard scaled error to eliminate the outliers. The newly proposed metric was applied to five benchmark datasets. Results showed that the metric is interpretable, robust to outliers, and competitive to other related metrics.
关键词:Pearson; Absolute Error; Text Similarity; Correlation; Scaled Square Error; Outliers