首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:An Improved Framework for Content-based Spamdexing Detection
  • 本地全文:下载
  • 作者:Asim Shahzad ; Hairulnizam Mahdin ; Nazri Mohd Nawi
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2020
  • 卷号:11
  • 期号:1
  • 页码:409-420
  • 出版社:Science and Information Society (SAI)
  • 摘要:To the modern Search Engines (SEs), one of the biggest threats to be considered is spamdexing. Nowadays spammers are using a wide range of techniques for content generation, they are using content spam to fill the Search Engine Result Pages (SERPs) with low-quality web pages. Generally, spam web pages are insufficient, irrelevant and improper results for users. Many researchers from academia and industry are working on spamdexing to identify the spam web pages. However, so far not even a single universally efficient method is developed for identification of all spam web pages. We believe that for tackling the content spam there must be improved methods. This article is an attempt in that direction, where a framework has been proposed for spam web pages identification. The framework uses Stop words, Keywords Density, Spam Keywords Database, Part of Speech (POS) ratio, and Copied Content algorithms. For conducting the experiments and obtaining threshold values WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets have been used. An excellent and promising F-measure of 77.38% illustrates the effectiveness and applicability of proposed method.
  • 关键词:Information retrieval; Web spam detection; content spam; pos ratio; search spam; Keywords stuffing; machine generated content detection
国家哲学社会科学文献中心版权所有