SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of file...

Full description

Bibliographic Details
Main Authors: D. Paul Joseph, P. Viswanathan
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10006815/
_version_ 1797902217333702656
author D. Paul Joseph
P. Viswanathan
author_facet D. Paul Joseph
P. Viswanathan
author_sort D. Paul Joseph
collection DOAJ
description Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search&#x2019;s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1&#x0025; in tier-1, identify 16.06&#x0025; of sensitive files in tier-2, and classify files with 91&#x0025; precision, 95&#x0025; sensitivity, 91&#x0025; accuracy, and 0.11&#x0025; Hamming loss compared to the two-tier system.
first_indexed 2024-04-10T09:14:13Z
format Article
id doaj.art-53daa62d25aa42d38f73c63630dd4988
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-10T09:14:13Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-53daa62d25aa42d38f73c63630dd49882023-02-21T00:01:55ZengIEEEIEEE Access2169-35362023-01-01113291330610.1109/ACCESS.2023.323443410006815SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification FrameworkD. Paul Joseph0https://orcid.org/0000-0003-2897-212XP. Viswanathan1School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, IndiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Vellore, IndiaMost traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search&#x2019;s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1&#x0025; in tier-1, identify 16.06&#x0025; of sensitive files in tier-2, and classify files with 91&#x0025; precision, 95&#x0025; sensitivity, 91&#x0025; accuracy, and 0.11&#x0025; Hamming loss compared to the two-tier system.https://ieeexplore.ieee.org/document/10006815/Digital forensicsdisc forensicsforensic data classificationmetadatapatternblacklisted keywords
spellingShingle D. Paul Joseph
P. Viswanathan
SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
IEEE Access
Digital forensics
disc forensics
forensic data classification
metadata
pattern
blacklisted keywords
title SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_full SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_fullStr SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_full_unstemmed SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_short SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_sort sdot secure hash semantic keyword extraction and dynamic operator pattern based three tier forensic classification framework
topic Digital forensics
disc forensics
forensic data classification
metadata
pattern
blacklisted keywords
url https://ieeexplore.ieee.org/document/10006815/
work_keys_str_mv AT dpauljoseph sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework
AT pviswanathan sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework