首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model
  • 本地全文:下载
  • 作者:Grigori Sidorov ; Alexander Gelbukh ; Helena Gómez-Adorno
  • 期刊名称:Computación y Sistemas
  • 印刷版ISSN:1405-5546
  • 出版年度:2014
  • 卷号:18
  • 期号:3
  • 页码:491-504
  • 语种:English
  • 出版社:Instituto Politécnico Nacional
  • 其他摘要:We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words “play” and “game” are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call “soft cosine measure”. We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.
  • 关键词:Soft similarity; soft cosine measure; vector space model; similarity between features; Levenshtein distance; n-grams; syntactic n-grams
国家哲学社会科学文献中心版权所有