SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework
Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of file...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10006815/ |
_version_ | 1797902217333702656 |
---|---|
author | D. Paul Joseph P. Viswanathan |
author_facet | D. Paul Joseph P. Viswanathan |
author_sort | D. Paul Joseph |
collection | DOAJ |
description | Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system. |
first_indexed | 2024-04-10T09:14:13Z |
format | Article |
id | doaj.art-53daa62d25aa42d38f73c63630dd4988 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-10T09:14:13Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-53daa62d25aa42d38f73c63630dd49882023-02-21T00:01:55ZengIEEEIEEE Access2169-35362023-01-01113291330610.1109/ACCESS.2023.323443410006815SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification FrameworkD. Paul Joseph0https://orcid.org/0000-0003-2897-212XP. Viswanathan1School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, IndiaSchool of Computer Science and Engineering, Vellore Institute of Technology, Vellore, IndiaMost traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.https://ieeexplore.ieee.org/document/10006815/Digital forensicsdisc forensicsforensic data classificationmetadatapatternblacklisted keywords |
spellingShingle | D. Paul Joseph P. Viswanathan SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework IEEE Access Digital forensics disc forensics forensic data classification metadata pattern blacklisted keywords |
title | SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework |
title_full | SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework |
title_fullStr | SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework |
title_full_unstemmed | SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework |
title_short | SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework |
title_sort | sdot secure hash semantic keyword extraction and dynamic operator pattern based three tier forensic classification framework |
topic | Digital forensics disc forensics forensic data classification metadata pattern blacklisted keywords |
url | https://ieeexplore.ieee.org/document/10006815/ |
work_keys_str_mv | AT dpauljoseph sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework AT pviswanathan sdotsecurehashsemantickeywordextractionanddynamicoperatorpatternbasedthreetierforensicclassificationframework |