首页    期刊浏览 2025年02月22日 星期六
登录注册

文章基本信息

  • 标题:A Content-based File Identification Dataset: collection, construction, and evaluation
  • 本地全文:下载
  • 作者:Saja Dheyaa Khudhur ; Hassan Awheed Jeiad
  • 期刊名称:Karbala International Journal of Modern Science
  • 印刷版ISSN:2405-609X
  • 电子版ISSN:2405-609X
  • 出版年度:2022
  • 卷号:8
  • 期号:2
  • 页码:188-195
  • DOI:10.33640/2405-609X.3222
  • 语种:English
  • 出版社:Elsevier
  • 摘要:File-Type Identification (FTI) is one of the essential functions that can be performed by examining the data blocks' magic numbers. However, this examination leads to a challenge when a file is corrupt, or these magic numbers are missing. Content-based analytics is the best way for file type identification when the magic numbers are not available. This paper prepares and presents a content-based dataset for eight common types of files based on twelve features. We designed our dataset to be used for supervised and unsupervised machine learning models. It provides the ability to classify and cluster these types into two levels, as a fine-grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video). A dataset quality and features assessments are performed in this study. The obtained results show that our dataset is high-quality, non-biased, complete, and with an acceptable duplication ratio. In addition, several multi-class classifiers are learned by our data, and classification accuracy of up to 81.8% is obtained. The main contributions of this work are summarized in constructing a new publicly available dataset based on statistical and information content-related features with detailed assessments and evaluation.
  • 关键词:Dataset;File;Type Identification (FTI);File Type classification;Fragment File Type Identification (FFTI);Machine Learning;Feature extraction;Dataset Evaluation;content;based analysis
国家哲学社会科学文献中心版权所有