Privacy-Preserving Machine Learning on Apache Spark

The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its...

Full description

Bibliographic Details
Main Authors:	Claudia V. Brito, Pedro G. Ferreira, Bernardo L. Portela, Rui C. Oliveira, Joao T. Paulo
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Privacy-preserving machine learning distributed systems apache spark trusted execution environments Intel SGX
Online Access:	https://ieeexplore.ieee.org/document/10314994/

_version_	1797545216504233984
author	Claudia V. Brito Pedro G. Ferreira Bernardo L. Portela Rui C. Oliveira Joao T. Paulo
author_facet	Claudia V. Brito Pedro G. Ferreira Bernardo L. Portela Rui C. Oliveira Joao T. Paulo
author_sort	Claudia V. Brito
collection	DOAJ
description	The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.
first_indexed	2024-03-10T14:12:20Z
format	Article
id	doaj.art-0e6c726e85be44e69198d2b860a9bc92
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-10T14:12:20Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-0e6c726e85be44e69198d2b860a9bc922023-11-21T00:01:40ZengIEEEIEEE Access2169-35362023-01-011112790712793010.1109/ACCESS.2023.333222210314994Privacy-Preserving Machine Learning on Apache SparkClaudia V. Brito0https://orcid.org/0000-0003-4293-9887Pedro G. Ferreira1https://orcid.org/0000-0003-3838-8664Bernardo L. Portela2https://orcid.org/0000-0002-7203-2621Rui C. Oliveira3Joao T. Paulo4https://orcid.org/0000-0001-9752-2822INESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalThe adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.https://ieeexplore.ieee.org/document/10314994/Privacy-preservingmachine learningdistributed systemsapache sparktrusted execution environmentsIntel SGX
spellingShingle	Claudia V. Brito Pedro G. Ferreira Bernardo L. Portela Rui C. Oliveira Joao T. Paulo Privacy-Preserving Machine Learning on Apache Spark IEEE Access Privacy-preserving machine learning distributed systems apache spark trusted execution environments Intel SGX
title	Privacy-Preserving Machine Learning on Apache Spark
title_full	Privacy-Preserving Machine Learning on Apache Spark
title_fullStr	Privacy-Preserving Machine Learning on Apache Spark
title_full_unstemmed	Privacy-Preserving Machine Learning on Apache Spark
title_short	Privacy-Preserving Machine Learning on Apache Spark
title_sort	privacy preserving machine learning on apache spark
topic	Privacy-preserving machine learning distributed systems apache spark trusted execution environments Intel SGX
url	https://ieeexplore.ieee.org/document/10314994/
work_keys_str_mv	AT claudiavbrito privacypreservingmachinelearningonapachespark AT pedrogferreira privacypreservingmachinelearningonapachespark AT bernardolportela privacypreservingmachinelearningonapachespark AT ruicoliveira privacypreservingmachinelearningonapachespark AT joaotpaulo privacypreservingmachinelearningonapachespark

Privacy-Preserving Machine Learning on Apache Spark

Similar Items