Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives

The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel a...

Full description

Bibliographic Details
Main Authors: Hao-Yi Yang, Zhi-Rong Lin, Ko-Chih Wang
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/14/10/285
_version_ 1797515587559096320
author Hao-Yi Yang
Zhi-Rong Lin
Ko-Chih Wang
author_facet Hao-Yi Yang
Zhi-Rong Lin
Ko-Chih Wang
author_sort Hao-Yi Yang
collection DOAJ
description The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.
first_indexed 2024-03-10T06:47:30Z
format Article
id doaj.art-a1a8a9a1665d44b98f886663075e6240
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-10T06:47:30Z
publishDate 2021-09-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-a1a8a9a1665d44b98f886663075e62402023-11-22T17:08:20ZengMDPI AGAlgorithms1999-48932021-09-01141028510.3390/a14100285Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel PrimitivesHao-Yi Yang0Zhi-Rong Lin1Ko-Chih Wang2Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanThe use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.https://www.mdpi.com/1999-4893/14/10/285large-scale data processingscientific datasetdistribution-based approachparallel algorithmdata-parallel primitive
spellingShingle Hao-Yi Yang
Zhi-Rong Lin
Ko-Chih Wang
Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
Algorithms
large-scale data processing
scientific dataset
distribution-based approach
parallel algorithm
data-parallel primitive
title Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
title_full Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
title_fullStr Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
title_full_unstemmed Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
title_short Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
title_sort efficient and portable distribution modeling for large scale scientific data processing with data parallel primitives
topic large-scale data processing
scientific dataset
distribution-based approach
parallel algorithm
data-parallel primitive
url https://www.mdpi.com/1999-4893/14/10/285
work_keys_str_mv AT haoyiyang efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives
AT zhironglin efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives
AT kochihwang efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives