Privacy-Preserving Machine Learning on Apache Spark

The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its...

Full description

Bibliographic Details
Main Authors: Claudia V. Brito, Pedro G. Ferreira, Bernardo L. Portela, Rui C. Oliveira, Joao T. Paulo
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10314994/
_version_ 1797545216504233984
author Claudia V. Brito
Pedro G. Ferreira
Bernardo L. Portela
Rui C. Oliveira
Joao T. Paulo
author_facet Claudia V. Brito
Pedro G. Ferreira
Bernardo L. Portela
Rui C. Oliveira
Joao T. Paulo
author_sort Claudia V. Brito
collection DOAJ
description The adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.
first_indexed 2024-03-10T14:12:20Z
format Article
id doaj.art-0e6c726e85be44e69198d2b860a9bc92
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-10T14:12:20Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-0e6c726e85be44e69198d2b860a9bc922023-11-21T00:01:40ZengIEEEIEEE Access2169-35362023-01-011112790712793010.1109/ACCESS.2023.333222210314994Privacy-Preserving Machine Learning on Apache SparkClaudia V. Brito0https://orcid.org/0000-0003-4293-9887Pedro G. Ferreira1https://orcid.org/0000-0003-3838-8664Bernardo L. Portela2https://orcid.org/0000-0002-7203-2621Rui C. Oliveira3Joao T. Paulo4https://orcid.org/0000-0001-9752-2822INESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalINESC TEC, Porto, PortugalThe adoption of third-party machine learning (ML) cloud services is highly dependent on the security guarantees and the performance penalty they incur on workloads for model training and inference. This paper explores security/performance trade-offs for the distributed Apache Spark framework and its ML library. Concretely, we build upon a key insight: in specific deployment settings, one can reveal carefully chosen non-sensitive operations (e.g. statistical calculations). This allows us to considerably improve the performance of privacy-preserving solutions without exposing the protocol to pervasive ML attacks. In more detail, we propose Soteria, a system for distributed privacy-preserving ML that leverages Trusted Execution Environments (e.g. Intel SGX) to run computations over sensitive information in isolated containers (enclaves). Unlike previous work, where all ML-related computation is performed at trusted enclaves, we introduce a hybrid scheme, combining computation done inside and outside these enclaves. The experimental evaluation validates that our approach reduces the runtime of ML algorithms by up to 41% when compared to previous related work. Our protocol is accompanied by a security proof and a discussion regarding resilience against a wide spectrum of ML attacks.https://ieeexplore.ieee.org/document/10314994/Privacy-preservingmachine learningdistributed systemsapache sparktrusted execution environmentsIntel SGX
spellingShingle Claudia V. Brito
Pedro G. Ferreira
Bernardo L. Portela
Rui C. Oliveira
Joao T. Paulo
Privacy-Preserving Machine Learning on Apache Spark
IEEE Access
Privacy-preserving
machine learning
distributed systems
apache spark
trusted execution environments
Intel SGX
title Privacy-Preserving Machine Learning on Apache Spark
title_full Privacy-Preserving Machine Learning on Apache Spark
title_fullStr Privacy-Preserving Machine Learning on Apache Spark
title_full_unstemmed Privacy-Preserving Machine Learning on Apache Spark
title_short Privacy-Preserving Machine Learning on Apache Spark
title_sort privacy preserving machine learning on apache spark
topic Privacy-preserving
machine learning
distributed systems
apache spark
trusted execution environments
Intel SGX
url https://ieeexplore.ieee.org/document/10314994/
work_keys_str_mv AT claudiavbrito privacypreservingmachinelearningonapachespark
AT pedrogferreira privacypreservingmachinelearningonapachespark
AT bernardolportela privacypreservingmachinelearningonapachespark
AT ruicoliveira privacypreservingmachinelearningonapachespark
AT joaotpaulo privacypreservingmachinelearningonapachespark