A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs

Data provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain...

Full description

Bibliographic Details
Main Authors: Yuanzhao Gao, Xingyuan Chen, Binglong Li, Xuehui Du
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10198446/
_version_ 1797752065198391296
author Yuanzhao Gao
Xingyuan Chen
Binglong Li
Xuehui Du
author_facet Yuanzhao Gao
Xingyuan Chen
Binglong Li
Xuehui Du
author_sort Yuanzhao Gao
collection DOAJ
description Data provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain the provenance information required for data usage supervision. However, the current research on the conjoint analysis of multiple logs is inadequate, and it is difficult for them to effectively integrate the provenance information extracted from different logs, especially in the big data scenario. For the near real-time provenance generation based on the analysis of multiple heterogeneous logs, this paper employs a Hadoop-based big data system as the research object, and proposes a parallel log analysis method based on auxiliary data structures and multi-threading. For the efficient conjoint analysis of multiple logs, 5 auxiliary data structures are constructed as the medium for the correlation and fusion of log information, and a multi-threading method is presented to parallelize the lookup of provenance information. In order to cope with the complex log record generation rules, log analysis methods for nondeterministic records, non-instantaneous operations, and instantaneous batch operations are proposed to generate provenance information correctly. In addition, a provenance generation framework is established to implement the proposed log analysis method. The experimental results show that the log collection time overhead caused by processing files above MB level is less than 0.1%. The proposed method can analyze logs in near real time and generate provenance information correctly.
first_indexed 2024-03-12T16:58:17Z
format Article
id doaj.art-f2dc6d0dd17a4602906753b0c32982cb
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T16:58:17Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-f2dc6d0dd17a4602906753b0c32982cb2023-08-07T23:00:33ZengIEEEIEEE Access2169-35362023-01-0111808068082110.1109/ACCESS.2023.330084410198446A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous LogsYuanzhao Gao0https://orcid.org/0000-0003-0060-0623Xingyuan Chen1https://orcid.org/0000-0002-9061-6524Binglong Li2Xuehui Du3Zhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaData provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain the provenance information required for data usage supervision. However, the current research on the conjoint analysis of multiple logs is inadequate, and it is difficult for them to effectively integrate the provenance information extracted from different logs, especially in the big data scenario. For the near real-time provenance generation based on the analysis of multiple heterogeneous logs, this paper employs a Hadoop-based big data system as the research object, and proposes a parallel log analysis method based on auxiliary data structures and multi-threading. For the efficient conjoint analysis of multiple logs, 5 auxiliary data structures are constructed as the medium for the correlation and fusion of log information, and a multi-threading method is presented to parallelize the lookup of provenance information. In order to cope with the complex log record generation rules, log analysis methods for nondeterministic records, non-instantaneous operations, and instantaneous batch operations are proposed to generate provenance information correctly. In addition, a provenance generation framework is established to implement the proposed log analysis method. The experimental results show that the log collection time overhead caused by processing files above MB level is less than 0.1%. The proposed method can analyze logs in near real time and generate provenance information correctly.https://ieeexplore.ieee.org/document/10198446/Big data provenanceprovenance generationmulti-log conjoint analysishadoop
spellingShingle Yuanzhao Gao
Xingyuan Chen
Binglong Li
Xuehui Du
A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
IEEE Access
Big data provenance
provenance generation
multi-log conjoint analysis
hadoop
title A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
title_full A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
title_fullStr A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
title_full_unstemmed A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
title_short A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
title_sort near real time big data provenance generation method based on the conjoint analysis of heterogeneous logs
topic Big data provenance
provenance generation
multi-log conjoint analysis
hadoop
url https://ieeexplore.ieee.org/document/10198446/
work_keys_str_mv AT yuanzhaogao anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT xingyuanchen anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT binglongli anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT xuehuidu anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT yuanzhaogao nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT xingyuanchen nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT binglongli nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs
AT xuehuidu nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs