SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of file...

Full description

Bibliographic Details
Main Authors:	D. Paul Joseph, P. Viswanathan
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Digital forensics disc forensics forensic data classification metadata pattern blacklisted keywords
Online Access:	https://ieeexplore.ieee.org/document/10006815/

_version_	1797902217333702656
author	D. Paul Joseph P. Viswanathan
author_facet	D. Paul Joseph P. Viswanathan
author_sort	D. Paul Joseph
collection	DOAJ
description	Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.
first_indexed	2024-04-10T09:14:13Z
format	Article
id	doaj.art-53daa62d25aa42d38f73c63630dd4988
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-10T09:14:13Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-53daa62d25aa42d38f73c63630dd49882023-02-21T00:01:55ZengIEEEIEEE Access2169-35362023-01-01113291330610.1109/ACCESS.2023.323443410006815SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification FrameworkD. Paul Joseph0https://orcid.org/0000-0003-2897-212XP. Viswanathan1School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, IndiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Vellore, IndiaMost traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.https://ieeexplore.ieee.org/document/10006815/Digital forensicsdisc forensicsforensic data classificationmetadatapatternblacklisted keywords
spellingShingle	D. Paul Joseph P. Viswanathan SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework IEEE Access Digital forensics disc forensics forensic data classification metadata pattern blacklisted keywords
title	SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_full	SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_fullStr	SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_full_unstemmed	SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_short	SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
title_sort	sdot secure hash semantic keyword extraction and dynamic operator pattern based three tier forensic classification framework
topic	Digital forensics disc forensics forensic data classification metadata pattern blacklisted keywords
url	https://ieeexplore.ieee.org/document/10006815/
work_keys_str_mv	AT dpauljoseph sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework AT pviswanathan sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework

SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

Similar Items