文章基本信息

标题：A Content-based File Identification Dataset: collection, construction, and evaluation
本地全文：下载
作者：Saja Dheyaa Khudhur ; Hassan Awheed Jeiad
期刊名称：Karbala International Journal of Modern Science
印刷版ISSN：2405-609X
电子版ISSN：2405-609X
出版年度：2022
卷号：8
期号：2
页码：188-195
DOI：10.33640/2405-609X.3222
语种：English
出版社：Elsevier
摘要：File-Type Identification (FTI) is one of the essential functions that can be performed by examining the data blocks' magic numbers. However, this examination leads to a challenge when a file is corrupt, or these magic numbers are missing. Content-based analytics is the best way for file type identification when the magic numbers are not available. This paper prepares and presents a content-based dataset for eight common types of files based on twelve features. We designed our dataset to be used for supervised and unsupervised machine learning models. It provides the ability to classify and cluster these types into two levels, as a fine-grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video). A dataset quality and features assessments are performed in this study. The obtained results show that our dataset is high-quality, non-biased, complete, and with an acceptable duplication ratio. In addition, several multi-class classifiers are learned by our data, and classification accuracy of up to 81.8% is obtained. The main contributions of this work are summarized in constructing a new publicly available dataset based on statistical and information content-related features with detailed assessments and evaluation.
关键词：Dataset;File;Type Identification (FTI);File Type classification;Fragment File Type Identification (FFTI);Machine Learning;Feature extraction;Dataset Evaluation;content;based analysis