Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel a...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-09-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/14/10/285 |
_version_ | 1797515587559096320 |
---|---|
author | Hao-Yi Yang Zhi-Rong Lin Ko-Chih Wang |
author_facet | Hao-Yi Yang Zhi-Rong Lin Ko-Chih Wang |
author_sort | Hao-Yi Yang |
collection | DOAJ |
description | The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing. |
first_indexed | 2024-03-10T06:47:30Z |
format | Article |
id | doaj.art-a1a8a9a1665d44b98f886663075e6240 |
institution | Directory Open Access Journal |
issn | 1999-4893 |
language | English |
last_indexed | 2024-03-10T06:47:30Z |
publishDate | 2021-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj.art-a1a8a9a1665d44b98f886663075e62402023-11-22T17:08:20ZengMDPI AGAlgorithms1999-48932021-09-01141028510.3390/a14100285Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel PrimitivesHao-Yi Yang0Zhi-Rong Lin1Ko-Chih Wang2Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanDepartment of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 11677, TaiwanThe use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.https://www.mdpi.com/1999-4893/14/10/285large-scale data processingscientific datasetdistribution-based approachparallel algorithmdata-parallel primitive |
spellingShingle | Hao-Yi Yang Zhi-Rong Lin Ko-Chih Wang Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives Algorithms large-scale data processing scientific dataset distribution-based approach parallel algorithm data-parallel primitive |
title | Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives |
title_full | Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives |
title_fullStr | Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives |
title_full_unstemmed | Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives |
title_short | Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives |
title_sort | efficient and portable distribution modeling for large scale scientific data processing with data parallel primitives |
topic | large-scale data processing scientific dataset distribution-based approach parallel algorithm data-parallel primitive |
url | https://www.mdpi.com/1999-4893/14/10/285 |
work_keys_str_mv | AT haoyiyang efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives AT zhironglin efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives AT kochihwang efficientandportabledistributionmodelingforlargescalescientificdataprocessingwithdataparallelprimitives |