A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs
Data provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10198446/ |
_version_ | 1797752065198391296 |
---|---|
author | Yuanzhao Gao Xingyuan Chen Binglong Li Xuehui Du |
author_facet | Yuanzhao Gao Xingyuan Chen Binglong Li Xuehui Du |
author_sort | Yuanzhao Gao |
collection | DOAJ |
description | Data provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain the provenance information required for data usage supervision. However, the current research on the conjoint analysis of multiple logs is inadequate, and it is difficult for them to effectively integrate the provenance information extracted from different logs, especially in the big data scenario. For the near real-time provenance generation based on the analysis of multiple heterogeneous logs, this paper employs a Hadoop-based big data system as the research object, and proposes a parallel log analysis method based on auxiliary data structures and multi-threading. For the efficient conjoint analysis of multiple logs, 5 auxiliary data structures are constructed as the medium for the correlation and fusion of log information, and a multi-threading method is presented to parallelize the lookup of provenance information. In order to cope with the complex log record generation rules, log analysis methods for nondeterministic records, non-instantaneous operations, and instantaneous batch operations are proposed to generate provenance information correctly. In addition, a provenance generation framework is established to implement the proposed log analysis method. The experimental results show that the log collection time overhead caused by processing files above MB level is less than 0.1%. The proposed method can analyze logs in near real time and generate provenance information correctly. |
first_indexed | 2024-03-12T16:58:17Z |
format | Article |
id | doaj.art-f2dc6d0dd17a4602906753b0c32982cb |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T16:58:17Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-f2dc6d0dd17a4602906753b0c32982cb2023-08-07T23:00:33ZengIEEEIEEE Access2169-35362023-01-0111808068082110.1109/ACCESS.2023.330084410198446A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous LogsYuanzhao Gao0https://orcid.org/0000-0003-0060-0623Xingyuan Chen1https://orcid.org/0000-0002-9061-6524Binglong Li2Xuehui Du3Zhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaZhengzhou Science and Technology Institute, Zhengzhou, ChinaData provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain the provenance information required for data usage supervision. However, the current research on the conjoint analysis of multiple logs is inadequate, and it is difficult for them to effectively integrate the provenance information extracted from different logs, especially in the big data scenario. For the near real-time provenance generation based on the analysis of multiple heterogeneous logs, this paper employs a Hadoop-based big data system as the research object, and proposes a parallel log analysis method based on auxiliary data structures and multi-threading. For the efficient conjoint analysis of multiple logs, 5 auxiliary data structures are constructed as the medium for the correlation and fusion of log information, and a multi-threading method is presented to parallelize the lookup of provenance information. In order to cope with the complex log record generation rules, log analysis methods for nondeterministic records, non-instantaneous operations, and instantaneous batch operations are proposed to generate provenance information correctly. In addition, a provenance generation framework is established to implement the proposed log analysis method. The experimental results show that the log collection time overhead caused by processing files above MB level is less than 0.1%. The proposed method can analyze logs in near real time and generate provenance information correctly.https://ieeexplore.ieee.org/document/10198446/Big data provenanceprovenance generationmulti-log conjoint analysishadoop |
spellingShingle | Yuanzhao Gao Xingyuan Chen Binglong Li Xuehui Du A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs IEEE Access Big data provenance provenance generation multi-log conjoint analysis hadoop |
title | A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs |
title_full | A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs |
title_fullStr | A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs |
title_full_unstemmed | A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs |
title_short | A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs |
title_sort | near real time big data provenance generation method based on the conjoint analysis of heterogeneous logs |
topic | Big data provenance provenance generation multi-log conjoint analysis hadoop |
url | https://ieeexplore.ieee.org/document/10198446/ |
work_keys_str_mv | AT yuanzhaogao anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT xingyuanchen anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT binglongli anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT xuehuidu anearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT yuanzhaogao nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT xingyuanchen nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT binglongli nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs AT xuehuidu nearrealtimebigdataprovenancegenerationmethodbasedontheconjointanalysisofheterogeneouslogs |